Catalyst is a MVC framework for Perl web application development. It’s similar in concept to Ruby on Rails (for Ruby) and Apache Struts (for Java). It’s been on my TODO list for nearly a year. I’ve started using it this week and am very impressed.

Here’s a good introductory article on what Catalyst is and an example application (using Ajax! so sexy!).

Other useful Catalyst links:

I call it “cattle-lust” just for fun. There’s a certain mad cow in every web developer.

The State of Search

Always searching for the latest in search engine development.

Some interesting work lately on CPAN.

I had not seen KinoSearch before; a very interesting “loose port” of Lucene written in C and Perl. I need to try it out.

Search::QueryParser also looks promising. I was just thinking that such a thing would be helpful for SWISH::HiLiter and/or HTML::HiLiter … or maybe even Swish-e itself.

Search::ContextGraph is older, one of Maciej Ceglowski’s projects from back before he became a fulltime painter and world traveller.

Search::Estraier is Dobrica Pavlinusic’s pure Perl implementation of the Hyper Estraier Perl API. I know Dobrica’s a big fan of Hyper Estraier, even over Swish-e and Xapian.

Search::FreeText appears to be abandoned, or at least not actively maintained. Last updated in 2003. Too bad; the documentation makes it look interesting anyway. Although it does use DB_File, which I know from experience with Perlfect, does not scale well above 20K+ documents.

Search::Xapian has recently been updated. There’s lots of activity on the Xapian project. It’s at or near the top of candidates for the Swish3 backends.

Search::InvertedIndex is one I had not seen before, but it looks very interesting. I have been thinking about a SwishQL – a SQL backend for Swish3, and Search::InvertedIndex offers a mysql backend. Benjamin Franz wrote it; he also wrote CGI::Minimal, which I used for a while with CrayDoc (contributed a patch too, iirc). I’ll have to come back to this one.

Search::Indexer. Another one I’d not heard of, but which bears investigation. Author Laurent Dami is a familiar name to me, as he also wrote Search::QueryParser (above) and a handy FormBuilder TT patch I’ve been using.


Bill pointed me at Xapian as a potential direction for a better Swish-e. I like what I’ve seen so far. Xapian is a C++ library for probablistic information retrieval, supports UTF encoding, and provides lots of language bindings via SWIG. Nice. I’ll post more as I play more.

SWISHED example

I’ve worked with Josh Rabinowitz to put together a demo of his SWISHED server.

The demo is hosted here at peknet. You can read all about it here:

Our idea for this demo was to illustrate how you might use SWISHED, SWISH::API::Remote and HTML::HiLiter. Those are all Perl projects, the first two written by Josh and the last written by me.


I’ve been putting together a history of CrayDoc for a presentation to the local MidWest XML users’ group.

Turns out the product goes back over 10 years. I inherited it in 2001, when I started in my current job at Cray. There was a time when Cray was owned by SGI, and during that time the documentation server was not known as CrayDoc — it was Dynaweb, a third party product. But starting in 2002, with the shipment of CrayDoc 1.0 (though I guess that versioning is erroneous, now that I know the history), CrayDoc is again a home-grown product. Versions 1 and 2 were 100% Perl, though now we use the SWISH-E search engine, which is written in C.

My work on CrayDoc has been a real education in CGI programming, HTML, databases, and code design. When the presentation is done, I’ll post it here for posterity.

UPDATE 11/22/2004: posted here PDF

SWISH-E and ranking algorithms

I’ve been actively making noise on the swish-e discussion list for over a year now. It’s a great open source indexing and searching tool. Love it. Loooove it. How’s that for Geek Love?

Part of the power of swish-e (the product is UPPERCASE, the command is lower, and I’m a lazy typist…) is in the libxml2 parser from the GNOME project. That thing flies. I’ve since started using the libxml2 tools in my other work as well.

Part of my work with swish-e has been in improving the ranking algorithm. I found a wealth of info on that subject, thanks in part to the success of google — which makes it easy to find information about what makes google work so well. How’s that for the tail wagging the dog? Or something like that.

Anyway, this has led me down the road of natural language query and methods of relevance ranking. Pretty dense stuff. My wee brain starts to twist and shudder. But I found this a good start and this even more helpful.

I have an email in to the developer about the open source status of the NITLE Semantic Engine, which looks like a really interesting idea. The author wrote this article about vector ranking, which I found very lucid.

HTML Highlighting

I can’t count the hours I’ve spent hacking at a foolproof highlighter for HTML. But I’m nearing a really good approximation of foolproof. I’ve posted HTML::HiLiter to the Perl CPAN.

The really hard thing about this was creating a regular expression that is fast enough to be useful but accurate enough to work 99% of the time. I ended up using the HTML::Parser module, which is ‘fast enough’ and very powerful, due to the embedded C code and some good design. I’ve also looked at HTML::Tree but because HTML::Parser was a standard module in Perl 5.6.x it makes more sense to me right now to use a widespread standard. It increases the chance that folks might find HTML::HiLiter useful.

The most recent version (0.11) is due to get posted soon. I’m excited about it: I’ve improved the speed and accuracy, and added several features to help support my other recent project: SWISH::HiLiter — an extension to the SWISH::API class.

Both these projects are open source and come out of my Cray work on CrayDoc. A huge project for me, and a real learning experience: character encodings, HTML syntax, and the power of Perl regular expressions. I’d wager that my Perl skills increased %500 as a result of this project.

If you use it, let me know what you think.