Perl and Search: Where are We?
KinoSearch and Xapian compared
Peter Karman
Anatomy of a Search Application
Every decent search application has these basic five
components:
- aggregator
- normalizer
- parser/analyzer
- indexer
- searcher
Aggregator
Gather a document collection. Document collections might
originate from:
- web
- filesystem
- database
- outer space
Normalizer
Documents come in a variety of formats, many of them with
MIME types that are not text/*.
Parser/Analyzer
Documents are tokenized into "words" with attention to
position, context, length and linguistic quality (stemming,
case, stopwords, etc.).
Indexer
Highly optimized storage system aims to preserve the
intelligence of the analysis.
Searcher
Parse a user's query and retrieve matching documents from the
index. Score and rank hits based on [your magic sauce here].
IR Libraries
- analyzer + indexer + searcher
- scoring/ranking algorithms
- huge (millions) document collections, multiple
machines
- Lucene (Java), Xapian (C++), KinoSearch (C/Perl)
KinoSearch
- fairly young (< 3 years)
- initially inspired by Lucene via Plucene
- initially mostly Perl, now mostly C and XS (70%
sloccount)
- small development community, lots of energy
- lots of features, ambitious architecture
- all UTF-8, all the time
Xapian
- stable: in various forms/identities for 25 years
(!)
- 1.0 release in 2007 with full UTF-8 support
- core C++ library with SWIG, JNI and Perl bindings
- Omega aggregator/normalizer package
- example: http://search.gmane.org/
Comparison
KinoSearch:
- one man show
- moving target
- Perlish API
Xapian:
- actively used in many languages
- real-world testing
- Perl bindings are direct C++ wrappers
Naive Benchmarks
$ time perl xindex.pl ~/projects/search_bench/
real 1m26.950s
user 1m12.400s
sys 0m10.117s
$ time perl xsearch.pl foo
Searching xapian_index
Running query 'Xapian::Query(foo)'
1 results found
ID 7 100% [ /Users/karpet/projects/search_bench/feldman-cia-worldfactbook-data.txt ]
real 0m0.064s
user 0m0.043s
sys 0m0.018s
$ time perl ksindex.pl ~/projects/search_bench/
real 0m54.725s
user 0m45.425s
sys 0m5.516s
$ time perl kssearch.pl foo
hits: 1
0.071 /Users/karpet/projects/search_bench/feldman-cia-worldfactbook-data.txt
real 0m0.206s
user 0m0.158s
sys 0m0.043s
Addendum #1
Marvin Humphrey, KinoSearch author, wrote after this presentation was given on 16 Feb 2008
and noted the following:
FWIW, since your sample search app only does one iteration and it doesn't reuse the Searcher,
it's not taking full advantage of KinoSearch's capabilities.
KS is supposed to be "fast enough" for a scenario just like that one,
and it seems to have performed acceptably, but searching is a *lot* faster when you cache the Searcher.
Check out the following stats courtesy of Benchmark::Stopwatch.
Regular CGI, at http://www.rectangular.com/cgi-bin/uscon_bench.cgi?q=congress&offset=0:
NAME TIME CUMULATIVE PERCENTAGE
load modules 0.121 0.121 73.754%
init searcher 0.004 0.125 2.626%
process search 0.032 0.158 19.735%
fetch hits 0.006 0.164 3.877%
_stop_ 0.000 0.164 0.008%
CGI::Fast, at http://www.rectangular.com/fcgi/uscon_search.cgi?q=congress&offset=0:
NAME TIME CUMULATIVE PERCENTAGE
process search 0.002 0.002 24.213%
fetch hits 0.006 0.008 75.602%
_stop_ 0.000 0.008 0.186%
Addendum #2: Swish-e 2.4 benchmark
Current Swish-e release 2.4.5 against same document corpus:
$ time swish-e -i ~/projects/search_bench
real 0m31.833s
user 0m16.493s
sys 0m7.499s
$ time swish-e -w foo
real 0m0.015s
user 0m0.007s
sys 0m0.008s