Archives: September 2006

10 Guidelines for Swish-e 3.0 Development

Thoughts on the Swish-e project (http://swish-e.org/).

Understand why folks like Swish-e 2 Fast. Easy to configure. Flexible. Keep it that way.

Understand why folks crave Swish-e 3 Folks like Fast, Easy and Flexible. They want to bring those qualities to bear on more difficult challenges.

* I18n demands multi-byte charset support. UTF-8 is the accepted standard. Swish-e 2 is stuck with single-byte charsets.

* Data sets are huge these days. Swish-e 2 doesn’t scale well past a few million documents.

* Huge data sets means lots of time spent indexing. Stable incremental index support (add, update, delete) is a must. Swish-e 2 has incremental support but it is buggy and the code is opaque.

* It’s a polylingual world. Swish-e lacks modern script language bindings beyond Perl.

C == Fast but C == Slow C (or other compiled languages) provide the best speed. The core code base for Swish-e is all in C. C is the *lingua franca* of the open source world.

But C is harder to write than most scripting languages, and takes much longer to develop and debug (and thus maintain). So projects that involve C attract fewer developers from the community. Fewer developers means (on the whole) that development time is slower. A couple good C developers can turn out good code quickly, but as with all OSS, maintenance and legacy become big issues — community (people) issues that become software issues. See the maintenance issue below.

Modular == cool. Monolithic != cool. Swish-e 2 revolves around the swish-e command line tool, a monolithic tool that parses, indexes and searches. A good step has been taken with libswish-e for splitting out search into its own library. Let’s continue that direction by splitting up the parser, indexer and searcher into separate, modular components (libraries). That increases Flexibility (while probably impacting Fast).

Maintainable code is a feature What happens when you get hit by a bus? Or get bored? Or move on? Or change careers and take up that basket weaving profession you’ve always secretly craved? Who will maintain your code? Did you document what you wrote? Did you check in your latest changes? Are your comments clear? Don’t fall into the trap that the code is the documentation. Swish-e is more than the development team *de jour*; folks will keep using it after you’re gone. OSS projects all suffer from this problem: check out the orphaned projects on sf.net.

Community is a feature Getting folks involved is one of the joys (and struggles) of OSS projects. Making it easy for folks to get involved, whether contributing documentation, tests, patches or good beer, is a good way to keep the fun in your own involvement.

Since writing C is a skill that fewer people have, encourage folks to write tests (using the TAP format), documentation, and how-tos.

And consider how much code needs to actually be written in C. Much of the strong parts of Swish-e are actually Perl scripts that support and supplement the core C program.

Don’t reinvent the wheel There are lots of search tools out there. Information retrieval is a hot subject right now. Swish-e 2 has some cool features. It’s Fast, Easy and Flexible. But it doesn’t do everything folks want it to.

However, other projects are strong where Swish-e is weak. There is quality open source IR code out there that does UTF-8, incremental indexing and good scaling. Good programmers are lazy. Let’s use other folks’ code to get the features we want.

Play to your strengths Those other IR projects might be weak where Swish-e 2 is strong. They might be Slow, Hard or Inflexible in key areas. Let’s figure out what makes Swish-e Fast, Easy and Flexible and concentrate on making those parts of the code easy to integrate with the quality pieces from those other IR projects. Remember: modular is cool.

This is supposed to be Fun Remember?

Swish3 Proposal

Thoughts on Swish-e version 3.

Assumptions * In order to keep Swish-e fast and portable, some key parts need to be written in a compiled language like C.

* C developers are increasingly harder to recruit to OSS projects like Swish-e.

* C is slower to develop and more difficult to maintain than non-compiled languages like Python or Perl.

* To encourage more code contributors to the project and make the project more useful to more people, make the core C parts library modules with well-defined and documented APIs. This makes the code more maintainable and flexible, and allows integration of other IR libraries like Xapian.

Core C Libraries *NOTE The following list is no longer accurate. libswish3 combines all these into one library.*

SwishUtils (libswishu) Common shared functions for things like IO, string handling, times, errors, memory and hashing.

I’ve started this one.

SwishConfig (libswishc) Parse config files into in-memory data structures, and read/write index config headers.

I’ve started this one.

SwishParser (libswishp) Parse documents into properties and wordlist.

I’ve started this one.

SwishIndex (libswishi) Store properties and wordlists.

TODO.

SwishSearch (libswishs) Parse queries and fetch results from an index.

Could be re-working of existing libswish-e to expect UTF-8 (which SwishUtils supports).

Envelopes

Bought a box of envelopes at the store this evening, in order to mail an overdue bill. I realized it had been over 8 years since I last bought envelopes. That’s a sign of the times: most bills come with a self-addressed envelope, and all my other correspondence happens electronically.