libswish3 is at the core of multiple Swish3 implementations, and has reached a
stable enough API that a 1.0.0 release seems appropriate.
From the README:
libswish3 is a document parser compatible with the Swish-e 2.4 -S prog API.
libswish3 is a C library for parsing documents into a data structure that can
then be stored and searched with a variety of IR backends.
There are currently four different implementations available of Swish3.
swish_xapian (C++ using libxapian, included in libswish3 distribution)
SWISH::Prog::Xapian (Perl using Search::Xapian)
SWISH::Prog::Lucy (Perl using Apache Lucy)
SWISH::Prog::KSx (Perl using KinoSearch)
All the Perl implementations are available from CPAN.
They each rely on SWISH::3 (the Perl bindings to libswish3) and the core SWISH::Prog project, a Perl
rewrite of the swish-e 2.x C binary and accompanying helper scripts. The
SWISH::Prog distribution includes a 'swish3' command line interface with options
very similar to the swish-e 2.x command line tool.
Xapian, KinoSearch and Apache Lucy all offer robust UTF-8 and incremental
indexing support, as well as the ability to scale to many millions of documents
across multiple servers.
One of the three virtues of programming is Laziness. Beware
of false laziness. Andy Lester writes
on the problem aptly when he describes an interaction with another programmer:
This person was one of those programmers who tried for the premature optimization of saving some typing. He forgot that typing is the least of our concerns when programming. He forgot that programmer thinking time costs many orders of magnitude more than programmer typing time, and that the time spent debugging can dwarf the amount of time spent creating code.
I can vouch for the writer's experience, though for me it has been less about back pain
(though I have that too) than eye strain (going on 7 years now). Biggest of all though
has been having children and working from home: that is the interruption formula in a nutshell.
Update: finally found a fix for this. The problem is that Perl has its own
my_setenv() function that interferes with the native setenv() called by
libswish3.c. The fix was to set the magic Perl var PL_use_safe_putenv as
shown here.
This took many hours and googling to track down. Glad to be done with it (I
hope!).
There's been a ton of work on Swish3 in the last year. I've actually started planning a 1.0 release,
after 5 years of work.
Lately I've been focusing on three things: (1) making the Perl bindings easier to install; (2) indexing
of compressed documents; and (3) supporting XInclude of document fragments. The first is accomplished: you
can install the entire library via CPAN. The last two are aimed at large
doc sets where I want to keep the XML compressed on disk for space reasons, and where I want to re-use
subsets of the document collections in building multiple indexes.
in a project and watching as 1000s of successful tests scroll
by, culminating in the
All tests successful.
message, gives me the same thrill
of satisfaction as when I used to paint houses, and having finished a long day of sweaty labor
at sanding and chipping old paint off, I could stand back and survey the structure,
primed and ready for a fresh coat of paint. It's the anticipation that thrills, in the
same way that a trip to the grocery store and a full fridge, or several loads of clean
laundry folded and stowed safely away in drawers, thrills me. The knowing that I am prepared,
belt cinched tight, all tests successful.
It's been a long week, culminating today in Frozen Perl 2010, a Perl conference for and by Perl hackers, here in the Twin Cities. I gave two talks at today's conference,
one on Swish3 and
the other on Devel::NYTProf and
Search::Tools. Both talks seemed well-received.
In the process of preparing the talks I also released a few new, related
modules to CPAN this week:
Search::Query now has support for SQL and SWISH Dialects. I hope to add
KinoSearch and Xapian dialects soon. The Search::Query::Parser now has
(undocumented and experimental) support for range queries, so that you can say:
foo=( 1..4 )
and that'll be expanded to
foo=( 1 OR 2 OR 3 OR 4 )
when the Dialect query object is stringified. Handy for things like ranges of
dates, which is how I am using it as $work.
Search::Tools, SWISH::API::*
New releases of these older modules as well, with some bug fixes and
refactoring to support the Search::Query.
So, yes. A busy week.
I enjoyed hearing other folks' talks today at Frozen Perl. There was a good
variety: pack/unpack, Unicode, i18n and best practice-related presentations. I
met some new people, renewed friendships with folks I already knew, and drank
lots of free coffee. The cookies were good too.
So I don't surf youtube very much. Or rather, only when my kids are wanting to watch
Wallace and Gromit trailers. So I'm always waaaay behind the times. That said, this video
is a riot.
For the last ten years I have used the color #E3BF70#fddc8e (hex) as my terminal background color. It's a darkish amber color
that is very easy on the eyes. I'm recording it here because every year or so I have to set up a new system
and always have to eyeball the settings till I get something close to what I am used to.
Update: 26 Jan 2009
Here's my .Xdefaults file for my xterm under X11 on OS X.
Contextual Query Language is defined
by the Library of Congress. I discovered it via CQL::Parser.
Brian Cassidy is involved, so it must be good.
I immediately thought "oh shit. Now my new Search::Query module feels late-to-the-party." But on further reading,
I think a CQL dialect in Search::Query makes some sense.
Search::Query is a SQL::Translator-like module for free-text search. I coded it up this week after brewing the idea for some many months. I'm imagining it now as a next-generation Search::QueryParser::SQL, for contexts beyond SQL. Example: I have a query string that works with Xapian and want to convert it to one that works with Swish-e 2.x or KinoSearch. Just parse it with Search::Query::Parser and assign it a target dialect and then call $query->stringify to get the translated version out.
I know the people who read this blog generally do not care about Perl at all (hi Mom!)
but I spend a great deal of time writing code in the language and talking with other
members of the Perl community about our common projects, and so like anyone who has lived
in the Perl world for any length of time, I have an opinion about Perl6. For those not
in the know, Perl5 is the current version of Perl and has been around for over 10 years.
Perl6 is the next major version evolution, but it has been in development for nearly the same
length of time. The problem is that 10 years is a long time for a computer language release
to gestate and many folks whose opinions count (i.e. managers) see that lack of a release
as a sign that Perl Is Dead and not a good choice for their next programming project. So (the
argument goes) Perl6's vaporware status makes it hard for Perl5 programmers to find jobs, because
the "if it ain't new it ain't sexy" ethos of technology counts for more than it should with those
making the money decisions.
The real problem isn't that Perl6 hasn't been released. The real problem is the name Perl6. Perl6
is not a single executable "thing" like Perl5 is; it's an umbrella for several different projects. Right
now I can sit down at just about any modern Unix-like computer and type 'perl' and write some code
that runs. Perl6 doesn't work quite that way. It's a whole new language, not just a major revision to
an existing language. So the version number 5 vs 6 is misleading. That's the problem. Perl is alive and well.
Perl5 continues to be maintained and developed. I get lots of work done every day using it.
Reading through Matt Trout's blog
just now I found this wonderful quote:
Because in free software a question in the form of a well thought out patch is one that almost always gets a constructive answer.
Yes. That's just it. A patch -- real, applicable code -- indicates genuine forethought and effort and I will reward
that kind of conversation every time with equal effort.
The Xapian backend for Swish3 has been getting some love lately.
The swish_xapian command line tool has most of the features now that swish-e v2.x does.
I'd been keeping an email with a link to Ovid's journal article
about reviewing Perl Training websites. Now I've deleted the email. But the link was worth keeping here.
While I would like to figure out how to compile as a native 64bit app, my
MacBook has too many libs from before the 10.5 to 10.6 upgrade to trust that all the dep
chain is 64bit compat.
The error I was seeing was the noxious BAD_ADDRESS error which traced back to
some libxml2 hash features. Red herring. Of course, I had to recompile libxml2
with the -m32 as well so that everything was 32bit compatible. Took me hours
before I noticed that the older working version on the same box was about half
the size of the new version... which triggered the ol' 32-vs-64-bit thing in
my brain.
Update: In the end this was a bug in libswish3 with confusing naming of
some variables. But the 64-bit thing was a Good Thing To Realize.
I need to test web apps with IE7 for $work. I work from home and use a
reverse SSH tunnel into the corporate LAN. I run a SOCKS5 proxy using the
-D option to ssh over the reverse tunnel. I use a Mac.
What's a geek to do with these odds and ends?
I run VirtualBox (free VM from Sun) with WinXP for IE7. No problem.
I use Putty to open a ssh SOCKS5 proxy over the reverse ssh tunnel. No problem.
Problem: IE7 does not route DNS requests over SOCKS so even though I can theoretically
get to the remote HTTP server, I can't resolve names inside the corporate LAN using the corporate
DNS server.
A nice little Windows app that lets any Windows app proxy through it. Now I can test my web apps
with IE7 under a VM on a Mac using a reverse SSH tunnel + SOCKS5 proxy.
The big thing in this release is a rewrite in XS/C for much of the tokenizing and snippet extraction
code. That, and lots more test coverage. A big thanks to Henry at zen for prompting this development
and release and for providing good bug reports.
I also want to acknowledge how awesome the NYTProf
profiling tool is. Helped me find all the bottlenecks.
For several years I have developed software projects using Perl, pushing them to the
shared Perl repository at CPAN. During that
time I have maintained my own Trac install at perl.peknet.com,
mostly for the use of the SVN browser, which I find helpful. I've started updating the wiki
on that site as a home base for my Perl projects. Google suggest to me that I've not made
that URL public before, so here it is, for the collective memory.
Thanks to the presence of mind of Marcel GrĂ¼nauer, the Perl community can easily see benchmarks for
common Perl accessor packages with
App::Benchmark::Accessors.
Glad to see Rose::Object (with Class::XSAccessor support) near the top of the list. That's what I chose
for Net::LDAP::Class, and I'll be switching to that for the rest of my projects RSN.
The tradeoff of creating more abstraction layers provides, as always, flexibility at the cost of complexity. Often times we resort to inferior workarounds because they seem simpler, when in truth they are just dumbing down the problem. KISS is not a synonym for "half assed".
PHP requires that you change your HTML to indicate that an input
value in a form expects multiple values. That means, your HTML
needs to know what your server-side architecture is coded in.
with that extra little [] bracket pair. That's just Wrong. And bad.
My HTML shouldn't care what the server side language is. HTTP is HTTP.
HTML is HTML. It's agnostic. Unless your scripting language is broken.
Like PHP is.
Given my new job, this quote from Anderson seems apt:
If so, leveraging the Free--paying people to get other people to write for
non-monetary rewards--may not be the enemy of professional journalists.
Instead, it may be their salvation.
Object-relational mappers are a nice way of simplifying data store interactions,
by abstracting the data model into a OO class structure. Or put another way,
don't write SQL, write code that is storage agnostic.
my $thing = Thing->new( id => 123 )->load;
$thing->foo('bar');
$thing->save;
#
# the above is mock code
# representing something like:
#
BEGIN TRANSACTION;
UPDATE table things
SET foo = 'bar'
WHERE ID = 123;
END TRANSACTION;
I've used a couple of different Perl ORMs over the last four years with great joy:
DBIx::Class and (mostly) Rose::DB::Object. Now I'm looking for a suitable PHP
project for my toolbelt.
The most popular (or at least most-mentioned). It has its own special query language (DQL),
which is a philosophical turn-off. Isn't SQL+PHP good enough? But I see the DQL is optional.
Ambitious. The docs make it seem a little like the Rose framework in its goals:
an ORM, a Form manager, a web framework. There's a DB abstraction layer that claims
to support many different db flavors. It seems pretty young though.
I heard this story on MPR a few weeks
ago and was pretty fired up about the idea. If you have a good idea for how to use government data in a web app,
let me know and maybe we can build one.
Recent article in Tech News World about
which languages are most popular.
Now, I expect most popular writing on programming to gloss over the actual technical stuff and speak directly to managers, who often can't program their way out of a paper bag. But this quote is just pure nonsense:
"Java and its variants like Perl, Ajax, Python and Ruby, which effectively generate Java code, are unnecessarily low-level languages," Infostructure Associates' Kernochan said. "Adopting Java was, until recently, a step back in programmer productivity."
Perl, Ajax, Python and Ruby are variants of Java? That's just wrong, technically and chronologically. Perl was first released in 1987. Java was first released in 1995. Ajax isn't a language at all, it's a pattern. It's like saying "Poems are a language." Python and Ruby, while object-oriented like Java, are certainly not variants. And none of them generate Java code, effectively or not.
I drank the koolaid a few years ago on the usefulness of test-driven development.
I have Perl and the Perl community to thank for that. chromatic outlines
the history of Perl's test-infected culture in a recent post.
Roy resurrected the Swish-e project nearly 10 years ago.
There's a nice
interview with him out recently, in which he says "there really isn't anything I can't do with Perl and my favorite indexing tool, Swish-e" -- which is exactly my experience too.
No, it's not what I suffer from the lack of (as in sleep). It's
Representational State Transfer. It's been a buzzword for
a few years now. I'm just now reading about it, and thought
I would include some highlights here for my own reference.
From the URL above, REST exhibits the following characteristics:
Client-Server: a pull-based interaction style: consuming components pull
representations.
Stateless: each request from client to server must contain all the
information necessary to understand the request, and cannot take advantage
of any stored context on the server.
Cache: to improve network efficiency responses must be capable of being
labeled as cacheable or non-cacheable.
Uniform interface: all resources are accessed with a generic interface
(e.g., HTTP GET, POST, PUT, DELETE).
Named resources - the system is comprised of resources which are named
using a URL.
Interconnected resource representations - the representations of the
resources are interconnected using URLs, thereby enabling a client to
progress from one state to another.
Layered components - intermediaries, such as proxy servers, cache servers,
gateways, etc, can be inserted between clients and resources to support
performance, security, etc.
Trying to make my CatalystX::CRUD project more RESTful.
Long before I was a computer programmer I was an essay writer, a songwriter, a poet. When I discovered
Perl, I found the transition to programming very natural. I had always played with computers,
back to the IBM PC and Macintosh circa 1984. My first program was in BASIC, in 1985. It was a
'choose your own adventure'-type program. Even then, I wanted to combine prose with code. It
was just Making Stuff with Words. I didn't differentiate.
chromatic
suggests Perl programmers can improve their code by thinking in terms of sentences and paragraphs. Best practices.
Makes perfect sense to me. When my friends ask me about my work I tell them I'm a writer, that a good
piece of Perl code has the same structure and thought behind it as a well-written essay, and that I practice
the art of writing every day. It's just that the language I write in is Perl, not English. I know my metaphor is
lost on most non-programmers. But I trust some people understand.