peknet :: an eddy in the bit stream         
about peknet
peknet is Peter E Karman musing on technology, politics, religion, books, beer and parenthood.

navigate

credits

Brighter Planet's 350 Challenge

St Paul Minnesota Yellow Pages

Powered by Swish-e

Valid CSS!

proud member of the
Open Source Community

© 2005 peknet dot com

syndicate this site

Xapian, KinoSearch

I'm starting in (again) on KS and Xapian backends for SWISH::Prog, using libswish3.

I just downloaded the latest Xapian release (1.0.10) and latest KS from svn. I started the build process for Xapian first, then svn up'd KS and started the KS build. The KS build finished, and all tests passed, before Xapian even finished its build. That could be due to the speed of the gcc vs g++ compilers. Dunno. It was just a startling thing to notice.

Now back to writing code...

File under projects/swish Tue Dec 30 20:15:45 CT 2008

Dog House

Truth in Advertising

File under general/ Wed Dec 10 08:20:51 CT 2008

I love kitties

File under general/ Sun Nov 9 08:44:52 CT 2008

Dobson

A friend tipped me off to this sociologist's take on Obama and the new evangelical left. Interesting read; make sure you look at the comments too.

File under general/ Wed Oct 8 20:40:26 CT 2008

Perl Myths

I had the good fortune to share a train ride with Tim Bunce while leaving OSCON in 2006. We had a nice chat. Just found his blog and talk from this year's OSCON -- the slides are worth a read.

File under projects/ Sun Oct 5 14:22:09 CT 2008

Veep Debate

Watching the US VP debate right now.

Sure, Palin is cute. And she has that "sure, heckava lotta" folksy smiles. She keeps appealing to the good ol' common sense of America, and when she smiles, I can feel red America getting redder.

Biden is kicking her butt. But I'm doubtful that the swing voters are hearing that.

File under general/ Thu Oct 2 20:33:27 CT 2008

Sun VirtualBox

Sun's virtual machine is free and open source. And cool. I tried it out tonight with W2K and am impressed.

File under projects/ Tue Sep 23 22:05:57 CT 2008

Swish3 Status 19 Sept 2008

A long hiatus for a full summer and then some contract work.

Some benchmarks of the latest tokenization algorithm shows that pure-ASCII tokenization is about 20% faster and UTF-8 tokenization is about 2% slower. So I'll take it.

Benchmark was performed by using perl/docmaker.pl to generate 100 random "docs" in both encodings (ASCII and random UTF-8) and then timed using swish_lint with and without the -t option.

Also recently fixed some failing tests on Linux and a memory warning.

File under projects/swish Fri Sep 19 05:33:31 CT 2008

Tests

Swish3 has a lot more tests that Swish-e (not counting Josh Rabinowitz's excellent testing package). And as of tonight all Swish3 tests are passing under both Linux 2.6 (CentOS 5) and OS X 10.4. \o/

File under projects/swish Thu Sep 18 22:54:29 CT 2008

Cognition

I'm not shilling for a corporation. Just found an announcement about a new semantic map that sounds very impressive, and further clicking found this blog.

File under projects/ Thu Sep 18 08:41:53 CT 2008

NASA Web Sites

Talk about an effort in information architecture.

File under general/ Thu Sep 18 08:27:44 CT 2008

CRUD

A year ago I announced the idea of a new Catalyst CRUD framework based on my previous work with Catalyst::Controller::Rose. The basic design hasn't changed much, and a year later, CatalystX::CRUD (CXCRUD) 0.30 hit CPAN today, along with a bunch of related modules.

Here's the rundown:
CatalystX::CRUD
The core API and base classes. There's a RPC-style Controller, a REST-style Controller, a Model, a ModelAdapter, and some support for those, especially for testing. There's been some misunderstanding about what CXCRUD is. It's not a scaffolding creator. It's not a code generator. It's not even really a framework. It's an API and some base classes that help implement the API. That's about it. CXCRUD makes it easier to get your form code and your model code into HTTP-land with Catalyst.


CatalystX::CRUD::Controller::RHTMLO
This base controller assumes you are using Rose::HTML::Form as your form class. The .pm file is only 200 lines long, most of which is documentation. The point is that CatalystX::CRUD::Controller (the base class) implements everything, and the RHTMLO version just adapts it a little to Rose::HTML::Form.


CatalystX::CRUD::Model::RDBO
Brings your RDBO and RDBO::Manager classes into the Catalyst::Model namespace. Mostly this class just implements the core API methods for searching and fetching objects.


CatalystX::CRUD::ModelAdapter::DBIC
Makes it easy to use your existing Catalyst::Model::DBIC::Schema code with CXCRUD. I want to thank Zbigniew Lukasiak for pushing me toward this particular implementation and away from the earlier (and now deprecated) CatalystX::CRUD::Model::DBIC. I've been using this particular module lately on a project so it has gotten some needed attention.


CatalystX::CRUD::View::Excel
The core API is silent with regard to the View. That's because the View is properly outside the concern of the create, read, update and delete actions. But that doesn't mean the View is irrelevant. The Excel View lets you return CRUD results as an Excel file since that seems to be a useful format for manipulating data on the desktop, outside the HTTP sphere -- and because it allows a far richer array of presentiation options that anything the core API could define.


CatalystX::CRUD::YUI
Six months ago I announced Rose::DBx::Garden::Catalyst, which continued to evolve as part of a project I was working on. It got to the point where I had put so much effort into the Template Toolkit design and Javascript (built on top of the Yahoo! User Interface library) that I felt it deserved to be split out into its own package and made independent of Rose::DB::Object.

CatalystX::CRUD::YUI 0.004 hit CPAN today. It conforms entirely to the core 0.30 CXCRUD API, which means that it is model-agnostic. You can use it with RDBO, DBIC and (hopefully soon) LDAP. Or whatever your model is. It just needs to implement the CatalystX::CRUD::Model or ModelAdapter API.

CatalystX::CRUD::YUI offers an easy way to do web-based administration of a database. It supports one2many and many2many relationships.


Rose::HTMLx::Form::Related
This package extends Rose::HTML::Form to perform model introspection. It currently has drivers for both RDBO and DBIC. It also comes out of the Rose::DBx::Garden::Catalyst project. Each model driver requires a helper package, either Rose::DBx::Object::MoreHelpers for RDBO or DBIx::Class::RDBOHelpers for DBIC. These helper packages implement the same methods and (in the case of RDBOHelpers in particular) aid in relationship introspection.


Rose::HTMLx::Form::Field::Serial
A new field type for RHTMLO for auto-increment fields.


Search::QueryParser::SQL
This is my latest project. It turns free-text queries into something that CXCRUD can use. There are methods for DBI, RDBO and DBIC, as well as raw SQL. Search::QueryParser is a great package, and it was fun to build on top of.


Rose::DBx::Garden::Catalyst
This package is now mostly a thin wrapper around CatalystX::CRUD::YUI and Rose::HTMLx::Form::Related, with the added features of code scaffolding generation. It's sort of like a Catalyst Helper on steroids. Version 0.09 has been a long time (well, months) in development, and now includes real tests thanks to the contribution of a schema from Laust Frederiksen.


File under projects/ Thu Sep 11 20:52:17 CT 2008

I guess a Community Organizer is like a Mayor ... except he has a soul.

File under general/ Sun Sep 7 20:34:49 CT 2008

Form philosophy

A form management package has one goal: to help preserve the integrity of data as it moves from server to client and back again.

Most form packages do two things: validate data and serialize data as (X)HTML. Some offer additional client-side validation checks via Javascript, etc. Others offer tight integration with particular data models.

Rose::HTML::Objects does both things well. RHTMLO allows you to define form classes that represent reasonably complicated data models, providing validation and serialization.

Some developers are of the opinion that serialization is not properly the function of a form manager because it blurs the line between view and model. I disagree. Proper and correct serialization is important to the validation process, and hence vital to the model. It is but one step in a series of validation layers.

Validation happens when a human being completes a (X)HTML form, as she self-monitors her attempt to enter good and correct data ("did I spell that correctly?"). Validation may happen again before the form is submitted, using client-side Javascript. Validation happens again when the server receives the request. Again, when the form object is initialized with the submitted data. Again, when the data is committed to storage. At every step checks are made to preserve what the client submitted and verify that the data conforms with what is expected and required.

Since serialization to and from (X)HTML is part of the roundtrip all data takes, a good form management tool should be able to handle (X)HTML creation as well as server-side validation. That's not blurring the model/view line; it's reflecting the reality that data must be handled by human beings and the web browser is one of the best tools we currently have.

File under projects/ Thu Aug 14 21:38:36 CT 2008

Wide Finder

An intriguing experiment in testing parallel architecture.

File under projects/ Mon Jul 28 22:16:43 CT 2008

Years pass, code grows, expectations ... ?

It's the OSCON time of year again. A couple of years ago I was in a hotel in Portland OR, coding up a KinoSearch implementation of Swish3. That was then. Two years later, Swish3 is really no further on, except that it has been largely refactored and is a much more stable, yet alpha, project.

Just remembering that tonight as I update the Perl bindings of Swish3 to use the new tokenization routines. It's the kind of project that moves in fits and starts. I would have liked to have finished it years ago, but it would not be as good, since many of the refactorings I have made over the last couple of years have been the direct result of coding strategies I have learned in my $job(s) and other FOSS projects.

And I find I like my life this way, chipping away at building a better mousetrap while I sip New Belgium beer and listen to Robert Plant and Alison Krauss. Sure, it's a sultry July evening, and my entire life occasionally floats through my mind like a grainy day at the beach, but hey: you only get this chance once, and building a better mousetrap is not a bad way to pass some idle moments.

File under projects/swish Thu Jul 24 21:46:09 CT 2008

Program like it's 1975

I benefited from this article.

File under projects/ Tue May 6 21:35:18 CT 2008

Jonah Plays the Blues

File under general/ Tue May 6 21:34:19 CT 2008

Alfred Hitchcock?

We've made the joke many times, but the resemblance in this montage is uncanny.

File under general/ Mon Apr 28 15:43:57 CT 2008

Swish3 Status 19 April 2008

There's been quite a bit of activity in the last month.
  • The C++ Xapian example now can search as well as index, and there are Perl equivalents using Search::Xapian checked into svn as well. The C++ code will read/write the swish.xml header; the Perl does not (yet).
  • The meta/prop id unique check now uses a hash for quick look up.
  • The test suite for libswish3 is totally restructured. Now using Perl's Test::Harness and added a slew of new meta/prop tests. Alongside that were additions to the NamedBuffer debugging output to print each substring in the buffer.
  • Several new string-related utility functions for converting ints to strings and back. Also a new config hash for configration options that use a StringList instead of a simple string.
  • Fixed some mem leaks in the example .c programs and added more info to the swish_lint usage() output (including reminders about the various SWISH_DEBUG* env var values).
There are still several parser features yet to be implemented to support the Swish-e 2.4 config options, but those will likely take a backseat to getting a working swish3 Perl script running with SWISH::Prog and SWISH::Prog::Xapian.

File under projects/swish Sat Apr 19 22:42:14 CT 2008

Ruby on Rails

I do not use it myself, preferring Perl+Catalyst. But this is an interesting perspective on deployment issues at a major shared hosting provider and as such, is not limited to RoR.

File under projects/ Fri Apr 11 09:06:46 CT 2008

Swish3 Status 30 March 2008

More progress with Swish3.
  • There is now a swish_xapian.cpp C++ example for using libswish3 with a Xapian backend. All that is complete is the indexing portion; still TODO is the search part. Still, a significant thing that it was so easy to build a search engine.
  • The swish.xml header format is complete and can now read/write the header file. Need to add that part to the swish_xapian.cpp example.
  • Squashed some long-standing memory leaks when using the filehandle functions.
Little by little.

File under projects/swish Sun Mar 30 22:03:48 CT 2008

Protest

This photo got a lot of laughter at our house.

File under general/ Sat Mar 22 21:21:53 CT 2008

Swish3 Status 2008-03-15

I've finally gotten back to Swish3 development after several months away. Hard to believe I've been working on this project for something close to 3 years now.

Lately I have been focusing on the following things:
Header file format
Because Swish3 will have multiple IR backends, it is important that there be a consistent index metadata file that describes the MetaNames, Properties, and tokenizing information, just like the Swish2 header does. Just as with the config file format, it makes sense to define the header file format as XML, since we already have a robust XML parser for free. To make it simple, I have defined the header file XML schema to be the same as the config file schema. In short, you configure Swish3 by creating a header file. The "real" header file will be more strict about explicitly naming all the expected attribute values, numbering the MetaName/Property ids, etc. But the idea is simple: a single XML schema.

I have written the code to read header/config file format and create a swish_Config object. There's also code for merging 2 swish_Config objects together, so that you can define a config file to override an existing header file.

Still TODO is the code for writing the header file.

The Perl bindings have been updated to reflect the new swish_Config API. This required a great deal of reworking and thinking about the Perl API. I had to rewrite things a few times to get a workable solution. The key Perl mantra is "objects on demand." I.e., do not define any Perl objects that wrap C pointers and try to keep them on the XS side. Instead, create all Perl objects "just in time" as part of the get_* method call. This makes reference counting much simpler.


MetaNames and PropertyNames
These now have their own C API with swish_MetaName and swish_Property structs. These relate directly to the header file format and swish_Config. There will end up being a separate PropertyName API for search results. I still think we're going to have to port the Swish2 PropertyNames storage/retrieval code to Swish3 and have a backend-indepedent index.prop file. The issue with this is going to be scaling. One other thought I've had is storing properties in a SQLite db. That route won't allow for presorted properties, but does have the advantage of being much more transparent and de-buggable.


SWISH::Prog
I have moved the SWISH::Prog svn tree to svn.swish-e.org from peknet.com. I also moved SWISH::Filter (and likely will eventually move SWISH::API::More and its cousins).

SWISH::Prog will form the framework for the Perl implementation of Swish3. I know there are some folks who don't like the idea of Swish3 being so Perl-centric. To that I can say only, tough luck. :)

Seriously though, my perspective is that there will be multiple Swish3 implementations. The one I am working on is in Perl using SWISH::Prog. There's nothing to stop someone from implementing one in C or C++ or Java or whatever. libswish3 provides the parsing/tokenizing piece missing from other IR projects, and it is a library for the very reason that implementing a Swish3 program should be language-neutral. If you can link against a C library, then you can write a Swish3 program. The header file API is well documented; the backend is supposed to be pluggable. It's all about the API.

I do intend to write a swish_xapian.cpp program eventually, showing how to implement a C++ Swish3 program with Xapian. That could be the fallback program if you really don't want to use Perl.


Documentation
I've stared a swish_intro.7 and swish_migration.7 set of docs. swish_intro will outline the aggregator/normalizer/analyzer/indexer/searcher philosophy and the outline of the libswish3 API. swish_migration will discuss differences in Swish2 vs Swish3 and how you can convert your config files and move to using Swish3.

File under projects/swish Sat Mar 15 22:23:14 CT 2008

HTTP Flow

This flowchart url posted today on #catalyst.

File under projects/ Tue Mar 11 09:16:26 CT 2008

Bug? Or Feature?

You decide.

File under projects/ Mon Mar 10 11:17:37 CT 2008

Why I sleep so little

File under projects/ Wed Feb 20 09:39:43 CT 2008

Frozen Perl

I'm at Frozen Perl this morning, finishing my search talk slides. This entry is to point interested folks at the svn repos where the slides and examples for both my talks reside.

Search and Perl: Where are We?

Unravelling the Spaghetti: Perl Best Practices and Existing Code

File under projects/ Sat Feb 16 10:51:22 CT 2008

Perl Modules I cannot live without

As I started to use CPAN, one of the hardest things to learn was, of the many modules that claim to implement the same (or similar) feature set, which is considered the best (stable, actively supported, secure) module for the task. It has taken me lots of trial and error, and I am still gleening best practices everyday.

In no particular order, some of the CPAN modules I consider essential, by which I mean that nearly every script or project I develop uses many from this list. (NOTE that I do not include any of my own CPAN modules here.)
  • Carp (part of the standard Perl install)
  • Path::Class
  • Class::Accesssor::Fast (or Object::Tiny)
  • File::Slurp
  • Data::Dump
  • Template::Toolkit
  • LWP
  • XML::LibXML
  • Rose::DB::Object (and DBI of course)
  • DateTime
  • Getopt::Long (for cli scripts)
  • Pod::Usage (ditto)

File under projects/ Fri Jan 18 12:02:26 CT 2008

OE on Facebook

Luddites beware!

File under general/ Fri Jan 18 08:25:05 CT 2008

Zero Inbox

Hi, I'm Peter and I have a problem with my Inbox.

File under general/ Thu Jan 17 11:12:24 CT 2008

Swift Kids for Truth

Nice.

File under general/ Thu Jan 17 11:11:12 CT 2008

Long Live Perl

Ran across this essay while googling for 'perl modules I cannot live without.'

File under projects/ Mon Jan 7 08:40:19 CT 2008


Past entries: 2004 . 2005 . 2006 . 2007 . 2008 . 2009 . 2010 . 2011 . 2012 . 2013 .