Frozen Perl 2010
It's been a long week, culminating today in
Frozen Perl 2010, a Perl conference for and by Perl hackers, here in the Twin Cities. I gave two talks at today's conference,
one on
Swish3 and
the other on
Devel::NYTProf and
Search::Tools. Both talks seemed well-received.
In the process of preparing the talks I also released a few new, related
modules to CPAN this week:
- Search::OpenSearch
- OpenSearch server glue for KinoSearch
and Swish-e 2.x via
SWISH::Prog. There's a
demo Plack app and ExtJS, using both search engines as part of the slides for my Swish3 talk.
I think OpenSearch is very cool and look forward to doing more with that spec,
including adding more features (e.g. facets) to Search::OpenSearch.
- Search::Query
- Search::Query now has support for SQL and SWISH Dialects. I hope to add
KinoSearch and Xapian dialects soon. The Search::Query::Parser now has
(undocumented and experimental) support for range queries, so that you can say:
foo=( 1..4 )
and that'll be expanded to
foo=( 1 OR 2 OR 3 OR 4 )
when the Dialect query object is stringified. Handy for things like ranges of
dates, which is how I am using it as $work.
- Search::Tools, SWISH::API::*
- New releases of these older modules as well, with some bug fixes and
refactoring to support the Search::Query.
So, yes. A busy week.
I enjoyed hearing other folks' talks today at Frozen Perl. There was a good
variety: pack/unpack, Unicode, i18n and best practice-related presentations. I
met some new people, renewed friendships with folks I already knew, and drank
lots of free coffee. The cookies were good too.
File under projects/swish
Sat Feb 6 23:27:19 CT 2010
SWISH::Prog::KSx and SWISH::Prog::Xapian on CPAN
Uploaded first pass at both implementations this last week.
The
announcement to the Swish-e list
just went out.
File under projects/swish
Mon Nov 30 22:19:26 CT 2009
SWISH::3 on CPAN
After 4 years of learning how to glue Perl and C together with XS and many sleepless nights,
I have released SWISH::3 to the CPAN.
<cue the sound of scattered applause>
Mostly this is a triumph of longevity rather than quality code. It's taken me this long to get
something workable.
File under projects/swish
Fri Nov 20 22:02:44 CT 2009
swish_xapian
The
Xapian backend for Swish3 has been getting some love lately.
The
swish_xapian command line tool has most of the features now that swish-e v2.x does.
I've posted about it on the
Swish-e wiki.
File under projects/swish
Wed Nov 18 23:22:11 CT 2009
Building Swish3 on OS X 10.6
I just wasted many hours trying to figure out why libswish3 failed to pass all tests on 10.6.
This link explains what I figured out to be true the hard way:
10.6 is now a mainsream 64bit OS !
10.5 a 64bit capable 32bit OS !
If I forced 32bit compile all is well:
CFLAGS="-m32 -O2 -g" ./configure && make test
While I would like to figure out how to compile as a native 64bit app, my
MacBook has too many libs from before the 10.5 to 10.6 upgrade to trust that all the dep
chain is 64bit compat.
The error I was seeing was the noxious BAD_ADDRESS error which traced back to
some libxml2 hash features. Red herring. Of course, I had to recompile libxml2
with the -m32 as well so that everything was 32bit compatible. Took me hours
before I noticed that the older working version on the same box was about half
the size of the new version... which triggered the ol' 32-vs-64-bit thing in
my brain.
Update: In the end this was a bug in libswish3 with confusing naming of
some variables. But the 64-bit thing was a Good Thing To Realize.
File under projects/swish
Mon Oct 5 23:05:50 CT 2009
Roy Tennant Interview
Roy resurrected the Swish-e project nearly 10 years ago.
There's a
nice
interview with him out recently, in which he says "there really isn't anything I can't do with Perl and my favorite indexing tool, Swish-e" -- which is exactly my experience too.
Thanks, Roy.
File under projects/swish
Sat Mar 28 21:14:16 CT 2009
Xapian, KinoSearch
I'm starting in (again) on KS and Xapian backends for SWISH::Prog, using libswish3.
I just downloaded the latest Xapian release (1.0.10) and latest KS from svn.
I started the build process for Xapian first, then svn up'd KS and started
the KS build. The KS build finished, and all tests passed, before Xapian
even finished its build. That could be due to the speed of the gcc vs g++ compilers.
Dunno. It was just a startling thing to notice.
Now back to writing code...
File under projects/swish
Tue Dec 30 20:15:45 CT 2008
Swish3 Status 19 Sept 2008
A long hiatus for a full summer and then some contract work.
Some benchmarks of
the
latest tokenization algorithm shows that pure-ASCII tokenization
is about 20% faster and UTF-8 tokenization is about 2% slower. So I'll
take it.
Benchmark was performed by using perl/docmaker.pl to generate 100 random
"docs" in both encodings (ASCII and random UTF-8) and then
timed using swish_lint with and without the -t option.
Also recently fixed some failing tests on Linux and a memory warning.
File under projects/swish
Fri Sep 19 05:33:31 CT 2008
Tests
Swish3 has a lot more tests that Swish-e (not counting Josh Rabinowitz's excellent testing package).
And as of tonight all Swish3 tests are passing under both Linux 2.6 (CentOS 5) and OS X 10.4.
\o/
File under projects/swish
Thu Sep 18 22:54:29 CT 2008
Years pass, code grows, expectations ... ?
It's the OSCON time of year again. A couple of years ago I was in a hotel in Portland OR,
coding up a KinoSearch implementation of Swish3. That was then. Two years later, Swish3
is really no further on, except that it has been largely refactored and is a much more
stable, yet alpha, project.
Just remembering that tonight as I update the Perl bindings of Swish3 to use the new
tokenization routines. It's the kind of project that moves in fits and starts. I would have
liked to have finished it years ago, but it would not be as good, since many of the refactorings
I have made over the last couple of years have been the direct result of coding strategies I
have learned in my $job(s) and other FOSS projects.
And I find I like my life this way, chipping away at building a better mousetrap while
I sip New Belgium beer and listen to Robert Plant and Alison Krauss. Sure, it's a sultry July
evening, and my entire life occasionally floats through my mind like a grainy day at the beach,
but hey: you only get this chance once, and building a better mousetrap is not a bad way to pass
some idle moments.
File under projects/swish
Thu Jul 24 21:46:09 CT 2008
Swish3 Status 19 April 2008
There's been quite a bit of activity in the last month.
- The C++ Xapian example now can search as well as index, and there are Perl
equivalents using Search::Xapian checked into svn as well. The C++ code
will read/write the swish.xml header; the Perl does not (yet).
-
The meta/prop id unique check now uses a hash for quick look up.
-
The test suite for libswish3 is totally restructured. Now using Perl's
Test::Harness and added a slew of new meta/prop tests. Alongside that
were additions to the NamedBuffer debugging output to print each
substring in the buffer.
-
Several new string-related utility functions for converting ints to strings
and back. Also a new config hash for configration options that use a StringList
instead of a simple string.
- Fixed some mem leaks in the example .c programs and added more info to the
swish_lint usage() output (including reminders about the various SWISH_DEBUG*
env var values).
There are still several parser features yet to be implemented to support the Swish-e
2.4 config options, but those will likely take a backseat to getting a working
swish3 Perl script running with SWISH::Prog and SWISH::Prog::Xapian.
File under projects/swish
Sat Apr 19 22:42:14 CT 2008
Swish3 Status 30 March 2008
More progress with Swish3.
-
There is now a swish_xapian.cpp C++ example for using
libswish3 with a Xapian backend. All that is complete is the indexing
portion; still TODO is the search part. Still, a significant thing that
it was so easy to build a search engine.
-
The swish.xml header format is complete and can now read/write the header file.
Need to add that part to the swish_xapian.cpp example.
-
Squashed some long-standing memory leaks when using the filehandle functions.
Little by little.
File under projects/swish
Sun Mar 30 22:03:48 CT 2008
Swish3 Status 2008-03-15
I've finally gotten back to Swish3 development after several months away. Hard to believe
I've been working on this project for something close to 3 years now.
Lately I have been focusing on the following things:
- Header file format
-
Because Swish3 will have multiple IR backends, it is important that there
be a consistent index metadata file that describes the MetaNames, Properties,
and tokenizing information, just like the Swish2 header does. Just as with the config
file format, it makes sense to define the header file format as XML, since we
already have a robust XML parser for free. To make it simple, I have defined
the header file XML schema to be the same as the config file schema. In short,
you configure Swish3 by creating a header file. The "real" header file will
be more strict about explicitly naming all the expected attribute values,
numbering the MetaName/Property ids, etc. But the idea is simple: a single
XML schema.
I have written the code to read header/config file format and create a
swish_Config object. There's also code for merging 2 swish_Config objects together,
so that you can define a config file to override an existing header file.
Still TODO is the code for writing the header file.
The Perl bindings have been updated to reflect the new swish_Config API. This
required a great deal of reworking and thinking about the Perl API. I had to rewrite
things a few times to get a workable solution. The key Perl mantra is "objects on demand."
I.e., do not define any Perl objects that wrap C pointers and try to keep them on the XS
side. Instead, create all Perl objects "just in time" as part of the get_* method call.
This makes reference counting much simpler.
- MetaNames and PropertyNames
- These now have their own C API with swish_MetaName and swish_Property structs.
These relate directly to the header file format and swish_Config. There will end up
being a separate PropertyName API for search results. I still think we're going to have
to port the Swish2 PropertyNames storage/retrieval code to Swish3 and have a backend-indepedent
index.prop file. The issue with this is going to be scaling. One other thought I've had
is storing properties in a SQLite db. That route won't allow for presorted properties,
but does have the advantage of being much more transparent and de-buggable.
- SWISH::Prog
- I have moved the SWISH::Prog svn tree to svn.swish-e.org from peknet.com. I also
moved SWISH::Filter (and likely will eventually move SWISH::API::More and its cousins).
SWISH::Prog will form the framework for the Perl implementation of Swish3. I know there
are some folks who don't like the idea of Swish3 being so Perl-centric. To that I can say only,
tough luck. :)
Seriously though, my perspective is that there will be multiple Swish3 implementations.
The one I am working on is in Perl using SWISH::Prog. There's nothing to stop someone from
implementing one in C or C++ or Java or whatever. libswish3 provides the parsing/tokenizing
piece missing from other IR projects, and it is a library for the very reason that implementing
a Swish3 program should be language-neutral. If you can link against a C library, then you can
write a Swish3 program. The header file API is well documented; the backend is supposed to be
pluggable. It's all about the API.
I do intend to write a swish_xapian.cpp program eventually, showing how to implement a C++
Swish3 program with Xapian. That could be the fallback program if you really don't want to use
Perl.
- Documentation
- I've stared a swish_intro.7 and swish_migration.7 set of docs. swish_intro will outline the
aggregator/normalizer/analyzer/indexer/searcher philosophy and the outline of the libswish3
API. swish_migration will discuss differences in Swish2 vs Swish3 and how you can convert your
config files and move to using Swish3.
File under projects/swish
Sat Mar 15 22:23:14 CT 2008
Xapian 1.0 Released
Announced this morning.
Now if only I could find time to finish the swishx Swish3 example program...
File under projects/swish
Fri May 18 09:00:48 CT 2007
Open Source Search Tools
I was answering an email tonight from the hyperestraier list about
Xapian and Lucene and KinoSearch, and as I was googling around to find
all the email threads I remembered being a part of on the topic,
it was interesting to see intersections I hadn't remembered, like how
the same people (like me) keep popping up around these tools.
There are some folks who just need to implement a search engine for their
website/company/intranet. These are the sysadmin types who just need
something that works so that they can move on to the next project.
Then there are folks working in the IR field itself who are trying to
build the Next Big Search Thing, following in google tradition. Good luck
to them. They'll need it.
Then there are folks like me, who are a little OCD over things like
IR and search. I consider the developers of the projects I list above
in that camp. It's a good camp to be in.
Open source search tools have come a long way and there is really some
good momentum now in implementing multiple terabyte, high volume search
projects using open source technology. I like working in IR at a time
like this. Hopeful. Almost. :)
File under projects/swish
Fri May 11 20:56:58 CT 2007
Swish3 Original Email
Was looking back through my email related to Swish3 and found the
original thread in which I describe the idea to the developer list.
The historian in me thought it would be good to preserve that link somewhere. And I notice that
original post was over 18 months ago. A (relatively) long time.
File under projects/swish
Sat Apr 14 20:36:52 CT 2007
Tokenizing
Marvin's got some
good
remarks on Perl's UTF-8 regexp vis-a-vis tokenizing strings. His remarks
are timely,
as I have been spending/wasting time lately in libswish3's C tokenizing functions. My goal was to replace
them with Perl regexp matching, but that may have been pre-mature given
Marvin's remarks.
File under projects/swish
Thu Mar 15 21:18:11 CT 2007
Past entries:
2004 .
2005 .
2006 .
2007 .
2008 .
2009 .
2010 .