Perl and Search: Where are We?

KinoSearch and Xapian compared

Peter Karman

http://www.peknet.com/~karpet/slides/fp/search

Anatomy of a Search Application

Every decent search application has these basic five components:

aggregator
normalizer
parser/analyzer
indexer
searcher

Aggregator

Gather a document collection. Document collections might originate from:

web
filesystem
database
outer space

Normalizer

Documents come in a variety of formats, many of them with MIME types that are not text/*.

Parser/Analyzer

Documents are tokenized into "words" with attention to position, context, length and linguistic quality (stemming, case, stopwords, etc.).

Indexer

Highly optimized storage system aims to preserve the intelligence of the analysis.

Searcher

Parse a user's query and retrieve matching documents from the index. Score and rank hits based on [your magic sauce here].

IR Libraries

analyzer + indexer + searcher
scoring/ranking algorithms
huge (millions) document collections, multiple machines
Lucene (Java), Xapian (C++), KinoSearch (C/Perl)

KinoSearch

fairly young (< 3 years)
initially inspired by Lucene via Plucene
initially mostly Perl, now mostly C and XS (70% sloccount)
small development community, lots of energy
lots of features, ambitious architecture
all UTF-8, all the time

Xapian

stable: in various forms/identities for 25 years (!)
1.0 release in 2007 with full UTF-8 support
core C++ library with SWIG, JNI and Perl bindings
Omega aggregator/normalizer package
example: http://search.gmane.org/

Comparison

KinoSearch:

one man show
moving target
Perlish API

Xapian:

actively used in many languages
real-world testing
Perl bindings are direct C++ wrappers

Naive Benchmarks

$ time perl xindex.pl ~/projects/search_bench/

real    1m26.950s
user    1m12.400s
sys     0m10.117s

$ time perl xsearch.pl foo
Searching xapian_index
Running query 'Xapian::Query(foo)'
1 results found
ID 7 100% [ /Users/karpet/projects/search_bench/feldman-cia-worldfactbook-data.txt ]

real    0m0.064s
user    0m0.043s
sys     0m0.018s

$ time perl ksindex.pl ~/projects/search_bench/

real    0m54.725s
user    0m45.425s
sys     0m5.516s

$ time perl kssearch.pl foo
hits: 1
0.071 /Users/karpet/projects/search_bench/feldman-cia-worldfactbook-data.txt

real    0m0.206s
user    0m0.158s
sys     0m0.043s

Addendum #1

Marvin Humphrey, KinoSearch author, wrote after this presentation was given on 16 Feb 2008 and noted the following:

FWIW, since your sample search app only does one iteration and it doesn't reuse the Searcher, 
it's not taking full advantage of KinoSearch's capabilities.  
KS is supposed to be "fast enough" for a scenario just like that one, 
and it seems to have performed acceptably, but searching is a *lot* faster when you cache the Searcher.

Check out the following stats courtesy of Benchmark::Stopwatch.

Regular CGI, at http://www.rectangular.com/cgi-bin/uscon_bench.cgi?q=congress&offset=0:

NAME                        TIME        CUMULATIVE      PERCENTAGE
 load modules                0.121       0.121           73.754%
 init searcher               0.004       0.125           2.626%
 process search              0.032       0.158           19.735%
 fetch hits                  0.006       0.164           3.877%
 _stop_                      0.000       0.164           0.008%

CGI::Fast, at http://www.rectangular.com/fcgi/uscon_search.cgi?q=congress&offset=0:

NAME                        TIME        CUMULATIVE      PERCENTAGE
 process search              0.002       0.002           24.213%
 fetch hits                  0.006       0.008           75.602%
 _stop_                      0.000       0.008           0.186%

Addendum #2: Swish-e 2.4 benchmark

Current Swish-e release 2.4.5 against same document corpus:

$ time swish-e -i ~/projects/search_bench
real    0m31.833s
user    0m16.493s
sys     0m7.499s

$ time swish-e -w foo
real    0m0.015s
user    0m0.007s
sys     0m0.008s