an eddy in the bitstream

Day: October 29, 2004

SWISH-E and ranking algorithms

I’ve been actively making noise on the swish-e discussion list for over a year now. It’s a great open source indexing and searching tool. Love it. Loooove it. How’s that for Geek Love?

Part of the power of swish-e (the product is UPPERCASE, the command is lower, and I’m a lazy typist…) is in the libxml2 parser from the GNOME project. That thing flies. I’ve since started using the libxml2 tools in my other work as well.

Part of my work with swish-e has been in improving the ranking algorithm. I found a wealth of info on that subject, thanks in part to the success of google — which makes it easy to find information about what makes google work so well. How’s that for the tail wagging the dog? Or something like that.

Anyway, this has led me down the road of natural language query and methods of relevance ranking. Pretty dense stuff. My wee brain starts to twist and shudder. But I found this a good start and this even more helpful.

I have an email in to the developer about the open source status of the NITLE Semantic Engine, which looks like a really interesting idea. The author wrote this article about vector ranking, which I found very lucid.

MOTU for sale

UPDATE: it sold. good luck to the next owner; make lots of good recordings.

Shameless consumerism. I’m selling the audio system used to record several projects, including the Brett Larson debut record, and several House of Mercy recordings (with Peter Rasmussen at the helm). I’ve listed it on ebay but you can find pics here.

HTML Highlighting

I can’t count the hours I’ve spent hacking at a foolproof highlighter for HTML. But I’m nearing a really good approximation of foolproof. I’ve posted HTML::HiLiter to the Perl CPAN.

The really hard thing about this was creating a regular expression that is fast enough to be useful but accurate enough to work 99% of the time. I ended up using the HTML::Parser module, which is ‘fast enough’ and very powerful, due to the embedded C code and some good design. I’ve also looked at HTML::Tree but because HTML::Parser was a standard module in Perl 5.6.x it makes more sense to me right now to use a widespread standard. It increases the chance that folks might find HTML::HiLiter useful.

The most recent version (0.11) is due to get posted soon. I’m excited about it: I’ve improved the speed and accuracy, and added several features to help support my other recent project: SWISH::HiLiter — an extension to the SWISH::API class.

Both these projects are open source and come out of my Cray work on CrayDoc. A huge project for me, and a real learning experience: character encodings, HTML syntax, and the power of Perl regular expressions. I’d wager that my Perl skills increased %500 as a result of this project.

If you use it, let me know what you think.

© 2024 peknet

Theme by Anders NorenUp ↑