Marvin’s got some good remarks on Perl’s UTF-8 regexp vis-a-vis tokenizing strings. His remarks are timely, as I have been spending/wasting time lately in libswish3’s C tokenizing functions. My goal was to replace them with Perl regexp matching, but that may have been pre-mature given Marvin’s remarks.
Multibyte encoding support is one of the big Swish3 features. I had what I thought was a workable framework for it using the C99 standard “wchar_t” wide character functions. But I’ve been disillusioned (which is usually a Good Thing) the last couple days based on some reading I’ve been doing on the linux-utf8 list archives.
Namely this link thread burst my bubble. But in a Good Way.
Seems “wchar_t” is not portable. Particularly on Windows (link) where it is defined as 16-bit rather than the full 32 bits required to represent the entire UTF-8 charset.
So I’ve been googling for other C libraries out there to help me. I need basically two kinds of functions:
“utf8_tolower( xmlChar * mixedCaseStr )” All the metanames and propertynames need to be normalized against parsed tagnames. So need to have this. Also, all strings are normalized to lowercase before tokenizing. That’s just a IR Good Practice.
“tokenize_utf8_string( xmlChar * utf8_string )” Split up a string into *words*. This really should be language-aware, but at the very least it needs to recognize what’s alpha vs whitespace vs punctuation, etc.
Both types of functions are crucial to the kind of string wrangling Swish3 needs to do. Along the way, it would be nice to build in a portable UTF-8-aware regular expression library, since that would make for a nice flexible way of configuring WordTokenPattern rather than needing to write a whole C function to tokenize a string into words. This kind of regexp support is provided via general C regexp in Swish-e, which isn’t UTF-8 aware I believe, or optionally via PCRE (Perl Compatable Regular Expressions) which (Dave tells me) isn’t developed for Win32 anymore.
So it would be nice, though not crucial, to get UTF-8-happy regex support if I can as well.
Google says many things about UTF-8 and i18n, but what it doesn’t tell me is what I should do. 🙂
Here are some things I’ve found:
link UTF-8 regexp library. Actually supports lots of different encodings, which I don’t really need.
link The ICU from IBM. Big blue and formidable. Way More than I need. Not going to use this one.
link Makes “wchar_t” portable. Kind of. But the author is the one who disillusioned me in that mail thread I mention above. So I’m not going down that road anymore.
This library doesn’t appear to be supported any more.
link Could be a starting point if I need to roll my own. Does UTF-8 to UC-4 conversion, so doesn’t depend on “wchar_t”. Has some UTF-8 functions, including a tolower().
link As Bill reminded me, libxml2 has unicode stuff in it too. Not exactly what I need, but could be a starting place. And it fits with my increasing sense of using libxml2 to do everything.
What’s in a Word? Word tokenization is the big issue. Swish-e tokenizes through a 256-byte lookup table: it’s perfect for 8bit encodings because it is fast and easy to understand. But the Hawker Observation applies: you need to seriously rethink the algorithm you’re using every time you increase your data set by two orders of magnitude. (link)
That’s why a regexp library or something else with predefined Unicode character tables is necessary. It’s a wheel that’s been invented. What I’m struggling with tonight is *which* wheel to use: what’s easy to implement, well supported and proven, and will be something I can use now and trust a year from now.
I have moved my last 2 years’ work over to the SVN repos at swish-e.org. Pretty significant for me, making the whole thing public now.
At this point, I’m working under the following assumptions:
Tokenizing Swish3 needs a UTF-8 aware tokenizer. This really means either a regexp library or a full UTF-8 character library. A regexp library is far easier with regard to user configuration, easier even than Swish-e’s *Characters directives.
I have looked at standalone C libraries to accomplish this. The best bet standalone is PCRE but building that for Windows is a little tricky. Swish-e offers optional PCRE support, so there is a precedent.
For the time being, however, I am going to build using Perl, which already has full UTF-8 regexp support. Why be Compatible when you can have the Real Thing?
Mostly I am using Perl because (a) I know it, (b) it’s ideal for things like search and indexing, and (c) I want to include a backend for KinoSearch.
There is nothing in the current libswish3 implementation that precludes using a different tokenizing scheme in the future. For my purposes, Perl is just fine.
Swish3 I have set up Perl bindings for libswish3’s configuation and parser functions in the SWISH::3 namespace. The plan is to write a Perl OO library to complement/wrap libswish3’s C functions. That Perl library can in turn be used to write applications. I’ll be implementing a “swish3” command line app in Perl using SWISH::Prog and distributing it as part of SWISH::Prog.
TODO There’s still a long todo list. But there is progress, and the shape of what is to come.