Archives: March 2008

Swish3 Status 30 March 2008

More progress with Swish3.

  • There is now a swish_xapian.cpp C++ example for using libswish3 with a Xapian backend. All that is complete is the indexing portion; still TODO is the search part. Still, a significant thing that it was so easy to build a search engine.
  • The swish.xml header format is complete and can now read/write the header file. Need to add that part to the swish_xapian.cpp example.
  • Squashed some long-standing memory leaks when using the filehandle functions.

Little by little.

Swish3 Status 2008-03-15

I’ve finally gotten back to Swish3 development after several months away. Hard to believe I’ve been working on this project for something close to 3 years now.

Lately I have been focusing on the following things:

Header file format
Because Swish3 will have multiple IR backends, it is important that there be a consistent index metadata file that describes the MetaNames, Properties, and tokenizing information, just like the Swish2 header does. Just as with the config file format, it makes sense to define the header file format as XML, since we already have a robust XML parser for free. To make it simple, I have defined the header file XML schema to be the same as the config file schema. In short, you configure Swish3 by creating a header file. The “real” header file will be more strict about explicitly naming all the expected attribute values, numbering the MetaName/Property ids, etc. But the idea is simple: a single XML schema.

I have written the code to read header/config file format and create a swish_Config object. There’s also code for merging 2 swish_Config objects together, so that you can define a config file to override an existing header file.

Still TODO is the code for writing the header file.

The Perl bindings have been updated to reflect the new swish_Config API. This required a great deal of reworking and thinking about the Perl API. I had to rewrite things a few times to get a workable solution. The key Perl mantra is “objects on demand.” I.e., do not define any Perl objects that wrap C pointers and try to keep them on the XS side. Instead, create all Perl objects “just in time” as part of the get_* method call. This makes reference counting much simpler.

MetaNames and PropertyNames
These now have their own C API with swish_MetaName and swish_Property structs. These relate directly to the header file format and swish_Config. There will end up being a separate PropertyName API for search results. I still think we’re going to have to port the Swish2 PropertyNames storage/retrieval code to Swish3 and have a backend-indepedent index.prop file. The issue with this is going to be scaling. One other thought I’ve had is storing properties in a SQLite db. That route won’t allow for presorted properties, but does have the advantage of being much more transparent and de-buggable.
SWISH::Prog
I have moved the SWISH::Prog svn tree to svn.swish-e.org from peknet.com. I also moved SWISH::Filter (and likely will eventually move SWISH::API::More and its cousins).

SWISH::Prog will form the framework for the Perl implementation of Swish3. I know there are some folks who don’t like the idea of Swish3 being so Perl-centric. To that I can say only, tough luck. 🙂

Seriously though, my perspective is that there will be multiple Swish3 implementations. The one I am working on is in Perl using SWISH::Prog. There’s nothing to stop someone from implementing one in C or C++ or Java or whatever. libswish3 provides the parsing/tokenizing piece missing from other IR projects, and it is a library for the very reason that implementing a Swish3 program should be language-neutral. If you can link against a C library, then you can write a Swish3 program. The header file API is well documented; the backend is supposed to be pluggable. It’s all about the API.

I do intend to write a swish_xapian.cpp program eventually, showing how to implement a C++ Swish3 program with Xapian. That could be the fallback program if you really don’t want to use Perl.

Documentation
I’ve stared a swish_intro.7 and swish_migration.7 set of docs. swish_intro will outline the aggregator/normalizer/analyzer/indexer/searcher philosophy and the outline of the libswish3 API. swish_migration will discuss differences in Swish2 vs Swish3 and how you can convert your config files and move to using Swish3.