peknet :: an eddy in the bit stream         
about peknet
peknet is Peter E Karman musing on technology, politics, religion, books, beer and parenthood.

navigate

credits

Brighter Planet's 350 Challenge

St Paul Minnesota Yellow Pages

Powered by Swish-e

Valid CSS!

proud member of the
Open Source Community

© 2005 peknet dot com

syndicate this site

Dezi 0.2.4 released

Happy to announce release of Dezi 0.2.4 with spelling suggestion feature.

File under projects/swish Wed Oct 17 22:42:28 CT 2012

Python Dezi Client

Finally got back to the Python Dezi client I started several months ago. All tests passing now so I'm calling this the 0.001000 release.

File under projects/swish Fri Aug 24 13:46:37 CT 2012

Dezi, now with Xapian backend

Read all about it.

File under projects/swish Mon Apr 30 21:39:13 CT 2012

Dezi search platform

This week I announced the initial release of Dezi, a new search platform based on Swish3, Apache Lucy, OpenSearch and Plack.

As of about 15 minutes ago, there are now PHP and Perl clients available.

File under projects/swish Sat Oct 1 21:47:46 CT 2011

libswish3 1.0.0 released

I am happy to announce the 1.0.0 release of libswish3:

http://swish-e.org/swish3/libswish3-1.0.0.tar.gz

libswish3 is at the core of multiple Swish3 implementations, and has reached a stable enough API that a 1.0.0 release seems appropriate.

From the README:

libswish3 is a document parser compatible with the Swish-e 2.4 -S prog API. libswish3 is a C library for parsing documents into a data structure that can then be stored and searched with a variety of IR backends.


There are currently four different implementations available of Swish3.
  • swish_xapian (C++ using libxapian, included in libswish3 distribution)
  • SWISH::Prog::Xapian (Perl using Search::Xapian)
  • SWISH::Prog::Lucy (Perl using Apache Lucy)
  • SWISH::Prog::KSx (Perl using KinoSearch)


All the Perl implementations are available from CPAN. They each rely on SWISH::3 (the Perl bindings to libswish3) and the core SWISH::Prog project, a Perl rewrite of the swish-e 2.x C binary and accompanying helper scripts. The SWISH::Prog distribution includes a 'swish3' command line interface with options very similar to the swish-e 2.x command line tool.

Xapian, KinoSearch and Apache Lucy all offer robust UTF-8 and incremental indexing support, as well as the ability to scale to many millions of documents across multiple servers.

You can read more about Swish3 at the devel site.

UPDATE: Mailing list announcement here.

File under projects/swish Wed Sep 21 22:03:59 CT 2011

Search::OpenSearch::Server with REST API

Just uploaded several modules to CPAN that together implement a full REST API for KinoSearch indexes, using Search::OpenSearch::Server::Plack.

% curl -XPOST http://localhost:5000/foo \ -d '<doc><title>bar</title>foo</doc>' \ -H 'Content-Type: application/xml' [response:] { "success":1, "doc":{ "orgs":[], "places":[], "people":[], "topics":[], "summary":"", "title":"bar", "author":[] }, "total":"21581", "code":"200" }


The modules are:
  • Search::OpenSearch 0.11
  • Search::OpenSearch::Server 0.05
  • Search::OpenSearch::Engine::KSx 0.08
  • SWISH::Prog::KSx 0.17
  • SWSIH::Prog 0.49


  • File under projects/swish Thu May 26 13:56:43 CT 2011

    CPAN test failures

    SWISH::3 0.08_04 is passing all tests all over the CPAN testers universe, so that is encouraging.

    However, some reports (notably on FreeBSD) report false failures because of a Wstat issue.

    I've posted about it at PerlMonks and hope someone out there has an easy fix.

    Update: finally found a fix for this. The problem is that Perl has its own my_setenv() function that interferes with the native setenv() called by libswish3.c. The fix was to set the magic Perl var PL_use_safe_putenv as shown here. This took many hours and googling to track down. Glad to be done with it (I hope!).

    File under projects/swish Mon Oct 11 00:37:31 CT 2010

    Swish3 progress report

    There's been a ton of work on Swish3 in the last year. I've actually started planning a 1.0 release, after 5 years of work.

    Lately I've been focusing on three things: (1) making the Perl bindings easier to install; (2) indexing of compressed documents; and (3) supporting XInclude of document fragments. The first is accomplished: you can install the entire library via CPAN. The last two are aimed at large doc sets where I want to keep the XML compressed on disk for space reasons, and where I want to re-use subsets of the document collections in building multiple indexes.

    File under projects/swish Tue Jun 8 23:33:18 CT 2010

    Frozen Perl 2010

    It's been a long week, culminating today in Frozen Perl 2010, a Perl conference for and by Perl hackers, here in the Twin Cities. I gave two talks at today's conference, one on Swish3 and the other on Devel::NYTProf and Search::Tools. Both talks seemed well-received.

    In the process of preparing the talks I also released a few new, related modules to CPAN this week:
    Search::OpenSearch
    OpenSearch server glue for KinoSearch and Swish-e 2.x via SWISH::Prog. There's a demo Plack app and ExtJS, using both search engines as part of the slides for my Swish3 talk.

    I think OpenSearch is very cool and look forward to doing more with that spec, including adding more features (e.g. facets) to Search::OpenSearch.
    Search::Query
    Search::Query now has support for SQL and SWISH Dialects. I hope to add KinoSearch and Xapian dialects soon. The Search::Query::Parser now has (undocumented and experimental) support for range queries, so that you can say:
    foo=( 1..4 )
    and that'll be expanded to
    foo=( 1 OR 2 OR 3 OR 4 )
    when the Dialect query object is stringified. Handy for things like ranges of dates, which is how I am using it as $work.
    Search::Tools, SWISH::API::*
    New releases of these older modules as well, with some bug fixes and refactoring to support the Search::Query.
    So, yes. A busy week.

    I enjoyed hearing other folks' talks today at Frozen Perl. There was a good variety: pack/unpack, Unicode, i18n and best practice-related presentations. I met some new people, renewed friendships with folks I already knew, and drank lots of free coffee. The cookies were good too.

    File under projects/swish Sat Feb 6 23:27:19 CT 2010

    SWISH::Prog::KSx and SWISH::Prog::Xapian on CPAN

    Uploaded first pass at both implementations this last week. The announcement to the Swish-e list just went out.

    File under projects/swish Mon Nov 30 22:19:26 CT 2009

    SWISH::3 on CPAN

    After 4 years of learning how to glue Perl and C together with XS and many sleepless nights, I have released SWISH::3 to the CPAN.

    <cue the sound of scattered applause>

    Mostly this is a triumph of longevity rather than quality code. It's taken me this long to get something workable.

    File under projects/swish Fri Nov 20 22:02:44 CT 2009

    swish_xapian

    The Xapian backend for Swish3 has been getting some love lately. The swish_xapian command line tool has most of the features now that swish-e v2.x does.

    I've posted about it on the Swish-e wiki.

    File under projects/swish Wed Nov 18 23:22:11 CT 2009

    Building Swish3 on OS X 10.6

    I just wasted many hours trying to figure out why libswish3 failed to pass all tests on 10.6.

    This link explains what I figured out to be true the hard way:
    10.6 is now a mainsream 64bit OS !

    10.5 a 64bit capable 32bit OS !


    If I forced 32bit compile all is well:
    CFLAGS="-m32 -O2 -g" ./configure && make test


    While I would like to figure out how to compile as a native 64bit app, my MacBook has too many libs from before the 10.5 to 10.6 upgrade to trust that all the dep chain is 64bit compat.

    The error I was seeing was the noxious BAD_ADDRESS error which traced back to some libxml2 hash features. Red herring. Of course, I had to recompile libxml2 with the -m32 as well so that everything was 32bit compatible. Took me hours before I noticed that the older working version on the same box was about half the size of the new version... which triggered the ol' 32-vs-64-bit thing in my brain.

    Update: In the end this was a bug in libswish3 with confusing naming of some variables. But the 64-bit thing was a Good Thing To Realize.

    File under projects/swish Mon Oct 5 23:05:50 CT 2009

    Roy Tennant Interview

    Roy resurrected the Swish-e project nearly 10 years ago. There's a nice interview with him out recently, in which he says "there really isn't anything I can't do with Perl and my favorite indexing tool, Swish-e" -- which is exactly my experience too.

    Thanks, Roy.

    File under projects/swish Sat Mar 28 21:14:16 CT 2009

    Xapian, KinoSearch

    I'm starting in (again) on KS and Xapian backends for SWISH::Prog, using libswish3.

    I just downloaded the latest Xapian release (1.0.10) and latest KS from svn. I started the build process for Xapian first, then svn up'd KS and started the KS build. The KS build finished, and all tests passed, before Xapian even finished its build. That could be due to the speed of the gcc vs g++ compilers. Dunno. It was just a startling thing to notice.

    Now back to writing code...

    File under projects/swish Tue Dec 30 20:15:45 CT 2008

    Swish3 Status 19 Sept 2008

    A long hiatus for a full summer and then some contract work.

    Some benchmarks of the latest tokenization algorithm shows that pure-ASCII tokenization is about 20% faster and UTF-8 tokenization is about 2% slower. So I'll take it.

    Benchmark was performed by using perl/docmaker.pl to generate 100 random "docs" in both encodings (ASCII and random UTF-8) and then timed using swish_lint with and without the -t option.

    Also recently fixed some failing tests on Linux and a memory warning.

    File under projects/swish Fri Sep 19 05:33:31 CT 2008

    Tests

    Swish3 has a lot more tests that Swish-e (not counting Josh Rabinowitz's excellent testing package). And as of tonight all Swish3 tests are passing under both Linux 2.6 (CentOS 5) and OS X 10.4. \o/

    File under projects/swish Thu Sep 18 22:54:29 CT 2008

    Years pass, code grows, expectations ... ?

    It's the OSCON time of year again. A couple of years ago I was in a hotel in Portland OR, coding up a KinoSearch implementation of Swish3. That was then. Two years later, Swish3 is really no further on, except that it has been largely refactored and is a much more stable, yet alpha, project.

    Just remembering that tonight as I update the Perl bindings of Swish3 to use the new tokenization routines. It's the kind of project that moves in fits and starts. I would have liked to have finished it years ago, but it would not be as good, since many of the refactorings I have made over the last couple of years have been the direct result of coding strategies I have learned in my $job(s) and other FOSS projects.

    And I find I like my life this way, chipping away at building a better mousetrap while I sip New Belgium beer and listen to Robert Plant and Alison Krauss. Sure, it's a sultry July evening, and my entire life occasionally floats through my mind like a grainy day at the beach, but hey: you only get this chance once, and building a better mousetrap is not a bad way to pass some idle moments.

    File under projects/swish Thu Jul 24 21:46:09 CT 2008

    Swish3 Status 19 April 2008

    There's been quite a bit of activity in the last month.
    • The C++ Xapian example now can search as well as index, and there are Perl equivalents using Search::Xapian checked into svn as well. The C++ code will read/write the swish.xml header; the Perl does not (yet).
    • The meta/prop id unique check now uses a hash for quick look up.
    • The test suite for libswish3 is totally restructured. Now using Perl's Test::Harness and added a slew of new meta/prop tests. Alongside that were additions to the NamedBuffer debugging output to print each substring in the buffer.
    • Several new string-related utility functions for converting ints to strings and back. Also a new config hash for configration options that use a StringList instead of a simple string.
    • Fixed some mem leaks in the example .c programs and added more info to the swish_lint usage() output (including reminders about the various SWISH_DEBUG* env var values).
    There are still several parser features yet to be implemented to support the Swish-e 2.4 config options, but those will likely take a backseat to getting a working swish3 Perl script running with SWISH::Prog and SWISH::Prog::Xapian.

    File under projects/swish Sat Apr 19 22:42:14 CT 2008

    Swish3 Status 30 March 2008

    More progress with Swish3.
    • There is now a swish_xapian.cpp C++ example for using libswish3 with a Xapian backend. All that is complete is the indexing portion; still TODO is the search part. Still, a significant thing that it was so easy to build a search engine.
    • The swish.xml header format is complete and can now read/write the header file. Need to add that part to the swish_xapian.cpp example.
    • Squashed some long-standing memory leaks when using the filehandle functions.
    Little by little.

    File under projects/swish Sun Mar 30 22:03:48 CT 2008

    Swish3 Status 2008-03-15

    I've finally gotten back to Swish3 development after several months away. Hard to believe I've been working on this project for something close to 3 years now.

    Lately I have been focusing on the following things:
    Header file format
    Because Swish3 will have multiple IR backends, it is important that there be a consistent index metadata file that describes the MetaNames, Properties, and tokenizing information, just like the Swish2 header does. Just as with the config file format, it makes sense to define the header file format as XML, since we already have a robust XML parser for free. To make it simple, I have defined the header file XML schema to be the same as the config file schema. In short, you configure Swish3 by creating a header file. The "real" header file will be more strict about explicitly naming all the expected attribute values, numbering the MetaName/Property ids, etc. But the idea is simple: a single XML schema.

    I have written the code to read header/config file format and create a swish_Config object. There's also code for merging 2 swish_Config objects together, so that you can define a config file to override an existing header file.

    Still TODO is the code for writing the header file.

    The Perl bindings have been updated to reflect the new swish_Config API. This required a great deal of reworking and thinking about the Perl API. I had to rewrite things a few times to get a workable solution. The key Perl mantra is "objects on demand." I.e., do not define any Perl objects that wrap C pointers and try to keep them on the XS side. Instead, create all Perl objects "just in time" as part of the get_* method call. This makes reference counting much simpler.


    MetaNames and PropertyNames
    These now have their own C API with swish_MetaName and swish_Property structs. These relate directly to the header file format and swish_Config. There will end up being a separate PropertyName API for search results. I still think we're going to have to port the Swish2 PropertyNames storage/retrieval code to Swish3 and have a backend-indepedent index.prop file. The issue with this is going to be scaling. One other thought I've had is storing properties in a SQLite db. That route won't allow for presorted properties, but does have the advantage of being much more transparent and de-buggable.


    SWISH::Prog
    I have moved the SWISH::Prog svn tree to svn.swish-e.org from peknet.com. I also moved SWISH::Filter (and likely will eventually move SWISH::API::More and its cousins).

    SWISH::Prog will form the framework for the Perl implementation of Swish3. I know there are some folks who don't like the idea of Swish3 being so Perl-centric. To that I can say only, tough luck. :)

    Seriously though, my perspective is that there will be multiple Swish3 implementations. The one I am working on is in Perl using SWISH::Prog. There's nothing to stop someone from implementing one in C or C++ or Java or whatever. libswish3 provides the parsing/tokenizing piece missing from other IR projects, and it is a library for the very reason that implementing a Swish3 program should be language-neutral. If you can link against a C library, then you can write a Swish3 program. The header file API is well documented; the backend is supposed to be pluggable. It's all about the API.

    I do intend to write a swish_xapian.cpp program eventually, showing how to implement a C++ Swish3 program with Xapian. That could be the fallback program if you really don't want to use Perl.


    Documentation
    I've stared a swish_intro.7 and swish_migration.7 set of docs. swish_intro will outline the aggregator/normalizer/analyzer/indexer/searcher philosophy and the outline of the libswish3 API. swish_migration will discuss differences in Swish2 vs Swish3 and how you can convert your config files and move to using Swish3.

    File under projects/swish Sat Mar 15 22:23:14 CT 2008

    SWISH::Prog take 2

    Spent the last week or 2 totally reworking SWISH::Prog. Reorganized the class layout to mirror the aggregator/parser/indexer/searcher paradigm I described some time ago. It has started to look a little like KinoSearch in that respect, with the addition of the aggregators and parser (which is of course Swish-e's contribution to IR).

    After mulling/experimenting for several days over how best to write the spider, I have decided to use WWW::Mechanize along with WWW::Rules and write from scratch. Then I'll provide backwards API compat for the Swish-e 2.4 spider.pl script config files/callbacks/etc. This proved easier than a direct port, and allows me to provide extensible caching/queueing/user_agent classes rather than hardcoding everything in a single script/library. I toyed with WWW::CheckSite but in order to make it work with the aggregator API required so many gymnastics it finally became easier to just write the spider myself. And a good programming exercise as well. :)

    File under projects/swish Mon Dec 10 22:29:46 CT 2007

    Xapian 1.0 Released

    Announced this morning.

    Now if only I could find time to finish the swishx Swish3 example program...

    File under projects/swish Fri May 18 09:00:48 CT 2007

    Open Source Search Tools

    I was answering an email tonight from the hyperestraier list about Xapian and Lucene and KinoSearch, and as I was googling around to find all the email threads I remembered being a part of on the topic, it was interesting to see intersections I hadn't remembered, like how the same people (like me) keep popping up around these tools.

    There are some folks who just need to implement a search engine for their website/company/intranet. These are the sysadmin types who just need something that works so that they can move on to the next project.

    Then there are folks working in the IR field itself who are trying to build the Next Big Search Thing, following in google tradition. Good luck to them. They'll need it.

    Then there are folks like me, who are a little OCD over things like IR and search. I consider the developers of the projects I list above in that camp. It's a good camp to be in.

    Open source search tools have come a long way and there is really some good momentum now in implementing multiple terabyte, high volume search projects using open source technology. I like working in IR at a time like this. Hopeful. Almost. :)

    File under projects/swish Fri May 11 20:56:58 CT 2007

    Swish3 Original Email

    Was looking back through my email related to Swish3 and found the original thread in which I describe the idea to the developer list.

    The historian in me thought it would be good to preserve that link somewhere. And I notice that original post was over 18 months ago. A (relatively) long time.

    File under projects/swish Sat Apr 14 20:36:52 CT 2007

    Tokenizing

    Marvin's got some good remarks on Perl's UTF-8 regexp vis-a-vis tokenizing strings. His remarks are timely, as I have been spending/wasting time lately in libswish3's C tokenizing functions. My goal was to replace them with Perl regexp matching, but that may have been pre-mature given Marvin's remarks.

    File under projects/swish Thu Mar 15 21:18:11 CT 2007

    UTF-8 Research

    Multibyte encoding support is one of the big Swish3 features. I had what I thought was a workable framework for it using the C99 standard "wchar_t" wide character functions. But I've been disillusioned (which is usually a Good Thing) the last couple days based on some reading I've been doing on the linux-utf8 list archives.

    Namely this thread burst my bubble. But in a Good Way.

    Seems "wchar_t" is not portable. Particularly on Windows () where it is defined as 16-bit rather than the full 32 bits required to represent the entire UTF-8 charset.

    So I've been googling for other C libraries out there to help me. I need basically two kinds of functions:

    "utf8_tolower( xmlChar * mixedCaseStr )" All the metanames and propertynames need to be normalized against parsed tagnames. So need to have this. Also, all strings are normalized to lowercase before tokenizing. That's just a IR Good Practice.

    "tokenize_utf8_string( xmlChar * utf8_string )" Split up a string into *words*. This really should be language-aware, but at the very least it needs to recognize what's alpha vs whitespace vs punctuation, etc.

    Both types of functions are crucial to the kind of string wrangling Swish3 needs to do. Along the way, it would be nice to build in a portable UTF-8-aware regular expression library, since that would make for a nice flexible way of configuring WordTokenPattern rather than needing to write a whole C function to tokenize a string into words. This kind of regexp support is provided via general C regexp in Swish-e, which isn't UTF-8 aware I believe, or optionally via PCRE (Perl Compatable Regular Expressions) which (Dave tells me) isn't developed for Win32 anymore.

    So it would be nice, though not crucial, to get UTF-8-happy regex support if I can as well.

    Google says many things about UTF-8 and i18n, but what it doesn't tell me is what I should do. :)

    Here are some things I've found:

    UTF-8 regexp library. Actually supports lots of different encodings, which I don't really need.

    The ICU from IBM. Big blue and formidable. Way More than I need. Not going to use this one.

    Makes "wchar_t" portable. Kind of. But the author is the one who disillusioned me in that mail thread I mention above. So I'm not going down that road anymore.

    This library doesn't appear to be supported any more.

    Could be a starting point if I need to roll my own. Does UTF-8 to UC-4 conversion, so doesn't depend on "wchar_t". Has some UTF-8 functions, including a tolower().

    As Bill reminded me, libxml2 has unicode stuff in it too. Not exactly what I need, but could be a starting place. And it fits with my increasing sense of using libxml2 to do everything.

    What's in a Word? Word tokenization is the big issue. Swish-e tokenizes through a 256-byte lookup table: it's perfect for 8bit encodings because it is fast and easy to understand. But the Hawker Observation applies: you need to seriously rethink the algorithm you're using every time you increase your data set by two orders of magnitude. ()

    That's why a regexp library or something else with predefined Unicode character tables is necessary. It's a wheel that's been invented. What I'm struggling with tonight is *which* wheel to use: what's easy to implement, well supported and proven, and will be something I can use now and trust a year from now.

    File under projects/swish Wed Mar 14 09:58:29 CT 2007

    libswish3

    I have moved my last 2 years' work over to the SVN repos at swish-e.org. Pretty significant for me, making the whole thing public now.

    At this point, I'm working under the following assumptions:

    Tokenizing Swish3 needs a UTF-8 aware tokenizer. This really means either a regexp library or a full UTF-8 character library. A regexp library is far easier with regard to user configuration, easier even than Swish-e's *Characters directives.

    I have looked at standalone C libraries to accomplish this. The best bet standalone is PCRE but building that for Windows is a little tricky. Swish-e offers optional PCRE support, so there is a precedent.

    For the time being, however, I am going to build using Perl, which already has full UTF-8 regexp support. Why be Compatible when you can have the Real Thing?

    Mostly I am using Perl because (a) I know it, (b) it's ideal for things like search and indexing, and (c) I want to include a backend for KinoSearch.

    There is nothing in the current libswish3 implementation that precludes using a different tokenizing scheme in the future. For my purposes, Perl is just fine.

    Swish3 I have set up Perl bindings for libswish3's configuation and parser functions in the SWISH::3 namespace. The plan is to write a Perl OO library to complement/wrap libswish3's C functions. That Perl library can in turn be used to write applications. I'll be implementing a "swish3" command line app in Perl using SWISH::Prog and distributing it as part of SWISH::Prog.

    TODO There's still a long todo list. But there is progress, and the shape of what is to come.

    File under projects/swish Sun Mar 4 21:14:21 CT 2007

    Swish3 Documentation

    I have posted the current working draft of the Swish3 documentation. These are mostly API docs for the C library.

    These APIs are subject to change. See the Swish3 dev site for up-to-date info and source.

    File under projects/swish Wed Feb 28 11:04:37 CT 2007

    CPAN modules updated

    Just uploaded SWISH-Prog 0.03 and Search-Tools 0.02 to the CPAN. They join SWISH-API-More, SWISH-API-Stat and SWISH-API-Object on the CPAN this week.

    Whew.

    File under projects/swish Fri Oct 6 19:48:52 CT 2006

    Bindings

    One of the things I like most about using other IR libraries as backends is that many of them offer language bindings in multiple other languages. So PHP, Ruby, Python, etc., users can be happy right away with search ability.

    Of course, if the indexing program is Perl instead of C, there is that added requirement. But hopefully, if the indexing API is well documented, there's nothing to stop implementations in other scripting languages besides Perl.

    Take Xapian for example. They have bindings available in nearly every major scripting language. So if you don't like the way Swish-e implements the indexing scheme, there's nothing to stop you from writing your own in your favorite language. At which point, you're not really using Swish-e any more. But you could mix/match depending on your needs. Use Swish-e's spider and SWISH::Filter, but your own parser and indexer, for example.

    File under projects/swish Wed Oct 4 14:17:18 CT 2006

    SWISH::Prog

    The general idea right now is to get the core C libraries functional, at least for SwishParser and SwishConfig. Then start working on the "swish-e" command line program replacement. I intend to write the replacement in Perl, since that will be much easier to write and performance should only see a small hit from startup costs. I'll use SWISH::Prog to handle the basic spider/fs stuff, as well as config parsing.

    Funny: I don't think I had that in mind when I originally started SWISH::Prog but it now seems like a totally obvious fit.

    SWISH::Prog::Config just underwent some major surgery. It can now parse version2 config files using the excellent Config::General, and can convert to the current SwishConfig XML format.

    I'll probably start with a Xapian backend since that's fairly stable (though UTF-8 support is still not official till 1.0). Need to write SWISH::Index and SWISH::Search APIs (though the latter will likely look just like SWISH::API).

    Everything in due time.

    File under projects/swish Wed Oct 4 10:53:24 CT 2006

    10 Guidelines for Swish-e 3.0 Development

    Thoughts on the Swish-e project (http://swish-e.org/).

    Understand why folks like Swish-e 2 Fast. Easy to configure. Flexible. Keep it that way.

    Understand why folks crave Swish-e 3 Folks like Fast, Easy and Flexible. They want to bring those qualities to bear on more difficult challenges.

    * I18n demands multi-byte charset support. UTF-8 is the accepted standard. Swish-e 2 is stuck with single-byte charsets.

    * Data sets are huge these days. Swish-e 2 doesn't scale well past a few million documents.

    * Huge data sets means lots of time spent indexing. Stable incremental index support (add, update, delete) is a must. Swish-e 2 has incremental support but it is buggy and the code is opaque.

    * It's a polylingual world. Swish-e lacks modern script language bindings beyond Perl.

    C == Fast but C == Slow C (or other compiled languages) provide the best speed. The core code base for Swish-e is all in C. C is the *lingua franca* of the open source world.

    But C is harder to write than most scripting languages, and takes much longer to develop and debug (and thus maintain). So projects that involve C attract fewer developers from the community. Fewer developers means (on the whole) that development time is slower. A couple good C developers can turn out good code quickly, but as with all OSS, maintenance and legacy become big issues -- community (people) issues that become software issues. See the maintenance issue below.

    Modular == cool. Monolithic != cool. Swish-e 2 revolves around the swish-e command line tool, a monolithic tool that parses, indexes and searches. A good step has been taken with libswish-e for splitting out search into its own library. Let's continue that direction by splitting up the parser, indexer and searcher into separate, modular components (libraries). That increases Flexibility (while probably impacting Fast).

    Maintainable code is a feature What happens when you get hit by a bus? Or get bored? Or move on? Or change careers and take up that basket weaving profession you've always secretly craved? Who will maintain your code? Did you document what you wrote? Did you check in your latest changes? Are your comments clear? Don't fall into the trap that the code is the documentation. Swish-e is more than the development team *de jour*; folks will keep using it after you're gone. OSS projects all suffer from this problem: check out the orphaned projects on sf.net.

    Community is a feature Getting folks involved is one of the joys (and struggles) of OSS projects. Making it easy for folks to get involved, whether contributing documentation, tests, patches or good beer, is a good way to keep the fun in your own involvement.

    Since writing C is a skill that fewer people have, encourage folks to write tests (using the TAP format), documentation, and how-tos.

    And consider how much code needs to actually be written in C. Much of the strong parts of Swish-e are actually Perl scripts that support and supplement the core C program.

    Don't reinvent the wheel There are lots of search tools out there. Information retrieval is a hot subject right now. Swish-e 2 has some cool features. It's Fast, Easy and Flexible. But it doesn't do everything folks want it to.

    However, other projects are strong where Swish-e is weak. There is quality open source IR code out there that does UTF-8, incremental indexing and good scaling. Good programmers are lazy. Let's use other folks' code to get the features we want.

    Play to your strengths Those other IR projects might be weak where Swish-e 2 is strong. They might be Slow, Hard or Inflexible in key areas. Let's figure out what makes Swish-e Fast, Easy and Flexible and concentrate on making those parts of the code easy to integrate with the quality pieces from those other IR projects. Remember: modular is cool.

    This is supposed to be Fun Remember?

    File under projects/swish Fri Sep 29 22:16:07 CT 2006

    Swish3 Proposal

    Thoughts on Swish-e version 3.

    Assumptions * In order to keep Swish-e fast and portable, some key parts need to be written in a compiled language like C.

    * C developers are increasingly harder to recruit to OSS projects like Swish-e.

    * C is slower to develop and more difficult to maintain than non-compiled languages like Python or Perl.

    * To encourage more code contributors to the project and make the project more useful to more people, make the core C parts library modules with well-defined and documented APIs. This makes the code more maintainable and flexible, and allows integration of other IR libraries like Xapian.

    Core C Libraries *NOTE The following list is no longer accurate. libswish3 combines all these into one library.*

    SwishUtils (libswishu) Common shared functions for things like IO, string handling, times, errors, memory and hashing.

    I've started this one.

    SwishConfig (libswishc) Parse config files into in-memory data structures, and read/write index config headers.

    I've started this one.

    SwishParser (libswishp) Parse documents into properties and wordlist.

    I've started this one.

    SwishIndex (libswishi) Store properties and wordlists.

    TODO.

    SwishSearch (libswishs) Parse queries and fetch results from an index.

    Could be re-working of existing libswish-e to expect UTF-8 (which SwishUtils supports).

    File under projects/swish Fri Sep 29 22:15:53 CT 2006


    Past entries: 2004 . 2005 . 2006 . 2007 . 2008 . 2009 . 2010 . 2011 . 2012 . 2013 . 2014 .
    « July  August 2014  Sept »
    Sun Mon Tue Wed Thu Fri Sat
             
               
    weather
    Saint Paul, MN
    81.0 F (27.2 C)
    A Few Clouds

    Sunrise/Sunset
    Updated:
    Aug 21 2014, 5:53 pm CDT

    worth reading