I have moved my last 2 years’ work over to the SVN repos at swish-e.org. Pretty significant for me, making the whole thing public now.
At this point, I’m working under the following assumptions:
Tokenizing Swish3 needs a UTF-8 aware tokenizer. This really means either a regexp library or a full UTF-8 character library. A regexp library is far easier with regard to user configuration, easier even than Swish-e’s *Characters directives.
I have looked at standalone C libraries to accomplish this. The best bet standalone is PCRE but building that for Windows is a little tricky. Swish-e offers optional PCRE support, so there is a precedent.
For the time being, however, I am going to build using Perl, which already has full UTF-8 regexp support. Why be Compatible when you can have the Real Thing?
Mostly I am using Perl because (a) I know it, (b) it’s ideal for things like search and indexing, and (c) I want to include a backend for KinoSearch.
There is nothing in the current libswish3 implementation that precludes using a different tokenizing scheme in the future. For my purposes, Perl is just fine.
Swish3 I have set up Perl bindings for libswish3’s configuation and parser functions in the SWISH::3 namespace. The plan is to write a Perl OO library to complement/wrap libswish3’s C functions. That Perl library can in turn be used to write applications. I’ll be implementing a “swish3” command line app in Perl using SWISH::Prog and distributing it as part of SWISH::Prog.
TODO There’s still a long todo list. But there is progress, and the shape of what is to come.