Archives: October 2004

SWISH-E and ranking algorithms

I’ve been actively making noise on the swish-e discussion list for over a year now. It’s a great open source indexing and searching tool. Love it. Loooove it. How’s that for Geek Love?

Part of the power of swish-e (the product is UPPERCASE, the command is lower, and I’m a lazy typist…) is in the libxml2 parser from the GNOME project. That thing flies. I’ve since started using the libxml2 tools in my other work as well.

Part of my work with swish-e has been in improving the ranking algorithm. I found a wealth of info on that subject, thanks in part to the success of google — which makes it easy to find information about what makes google work so well. How’s that for the tail wagging the dog? Or something like that.

Anyway, this has led me down the road of natural language query and methods of relevance ranking. Pretty dense stuff. My wee brain starts to twist and shudder. But I found this a good start and this even more helpful.

I have an email in to the developer about the open source status of the NITLE Semantic Engine, which looks like a really interesting idea. The author wrote this article about vector ranking, which I found very lucid.

MOTU for sale

UPDATE: it sold. good luck to the next owner; make lots of good recordings.

Shameless consumerism. I’m selling the audio system used to record several projects, including the Brett Larson debut record, and several House of Mercy recordings (with Peter Rasmussen at the helm). I’ve listed it on ebay but you can find pics here.

HTML Highlighting

I can’t count the hours I’ve spent hacking at a foolproof highlighter for HTML. But I’m nearing a really good approximation of foolproof. I’ve posted HTML::HiLiter to the Perl CPAN.

The really hard thing about this was creating a regular expression that is fast enough to be useful but accurate enough to work 99% of the time. I ended up using the HTML::Parser module, which is ‘fast enough’ and very powerful, due to the embedded C code and some good design. I’ve also looked at HTML::Tree but because HTML::Parser was a standard module in Perl 5.6.x it makes more sense to me right now to use a widespread standard. It increases the chance that folks might find HTML::HiLiter useful.

The most recent version (0.11) is due to get posted soon. I’m excited about it: I’ve improved the speed and accuracy, and added several features to help support my other recent project: SWISH::HiLiter — an extension to the SWISH::API class.

Both these projects are open source and come out of my Cray work on CrayDoc. A huge project for me, and a real learning experience: character encodings, HTML syntax, and the power of Perl regular expressions. I’d wager that my Perl skills increased %500 as a result of this project.

If you use it, let me know what you think.

perldoc 5.8.1

Just added the Perl docs for v 5.8.1 to the docs/ section. This is, of course, my favorite programming language… Now it’s searchable via the main search tool. No more waiting if is down.

glibc docs

I added the latest glibc docs to the docs section. Mostly because I needed a quick searchable reference as I teach myself C. Of course, I found out afterwards that glibc is not supported on Mac OS X, so it proved kind of moot. But at least the reference is handy and it was a good exercise in usability. I’ll probably use that method again.

new format

I’m playing with the blosxom plugin architecture. It’s pretty geek-cool (should I refer to that as GC?). So now I have an _intro file that always sorts to the top of my blog, but is just a regular blog file like the others. Ah, how I amuse myself.

I also moved the footer of the main page to the end of the blog instead of as a persistent frame. Seems like they take too much consistent real estate otherwise.

From the Holy Mountain

I just finished a report for my MLIS program on William Dalrymple’s excellent book, From the Holy Mountain: A Journey among the Christians of the Middle East.

The report involved a survey of local libraries with an eye toward if their collections would support the writing of a particular section of Dalrymple’s book. The gist is (surprise!) that our academic libraries are a better bet than our public libraries.

But don’t let that scare you off. This book is mesmerizing and funny, tragic and involving. Get a copy at your local (public) library.