Multibyte encoding support is one of the big Swish3 features. I had what I thought was a workable framework for it using the C99 standard “wchar_t” wide character functions. But I’ve been disillusioned (which is usually a Good Thing) the last couple days based on some reading I’ve been doing on the linux-utf8 list archives.
Namely this link thread burst my bubble. But in a Good Way.
Seems “wchar_t” is not portable. Particularly on Windows (link) where it is defined as 16-bit rather than the full 32 bits required to represent the entire UTF-8 charset.
So I’ve been googling for other C libraries out there to help me. I need basically two kinds of functions:
“utf8_tolower( xmlChar * mixedCaseStr )” All the metanames and propertynames need to be normalized against parsed tagnames. So need to have this. Also, all strings are normalized to lowercase before tokenizing. That’s just a IR Good Practice.
“tokenize_utf8_string( xmlChar * utf8_string )” Split up a string into *words*. This really should be language-aware, but at the very least it needs to recognize what’s alpha vs whitespace vs punctuation, etc.
Both types of functions are crucial to the kind of string wrangling Swish3 needs to do. Along the way, it would be nice to build in a portable UTF-8-aware regular expression library, since that would make for a nice flexible way of configuring WordTokenPattern rather than needing to write a whole C function to tokenize a string into words. This kind of regexp support is provided via general C regexp in Swish-e, which isn’t UTF-8 aware I believe, or optionally via PCRE (Perl Compatable Regular Expressions) which (Dave tells me) isn’t developed for Win32 anymore.
So it would be nice, though not crucial, to get UTF-8-happy regex support if I can as well.
Google says many things about UTF-8 and i18n, but what it doesn’t tell me is what I should do. 🙂
Here are some things I’ve found:
link UTF-8 regexp library. Actually supports lots of different encodings, which I don’t really need.
link The ICU from IBM. Big blue and formidable. Way More than I need. Not going to use this one.
link Makes “wchar_t” portable. Kind of. But the author is the one who disillusioned me in that mail thread I mention above. So I’m not going down that road anymore.
This library doesn’t appear to be supported any more.
link Could be a starting point if I need to roll my own. Does UTF-8 to UC-4 conversion, so doesn’t depend on “wchar_t”. Has some UTF-8 functions, including a tolower().
link As Bill reminded me, libxml2 has unicode stuff in it too. Not exactly what I need, but could be a starting place. And it fits with my increasing sense of using libxml2 to do everything.
What’s in a Word? Word tokenization is the big issue. Swish-e tokenizes through a 256-byte lookup table: it’s perfect for 8bit encodings because it is fast and easy to understand. But the Hawker Observation applies: you need to seriously rethink the algorithm you’re using every time you increase your data set by two orders of magnitude. (link)
That’s why a regexp library or something else with predefined Unicode character tables is necessary. It’s a wheel that’s been invented. What I’m struggling with tonight is *which* wheel to use: what’s easy to implement, well supported and proven, and will be something I can use now and trust a year from now.