Projects

In this area you'll find other things I'm working on and/or interested in.

But have I mentioned my beautiful son lately?

Swish3 Status 19 April 2008

There's been quite a bit of activity in the last month.

  • The C++ Xapian example now can search as well as index, and there are Perl equivalents using Search::Xapian checked into svn as well. The C++ code will read/write the swish.xml header; the Perl does not (yet).
  • The meta/prop id unique check now uses a hash for quick look up.
  • The test suite for libswish3 is totally restructured. Now using Perl's Test::Harness and added a slew of new meta/prop tests. Alongside that were additions to the NamedBuffer debugging output to print each substring in the buffer.
  • Several new string-related utility functions for converting ints to strings and back. Also a new config hash for configration options that use a StringList instead of a simple string.
  • Fixed some mem leaks in the example .c programs and added more info to the swish_lint usage() output (including reminders about the various SWISH_DEBUG* env var values).
There are still several parser features yet to be implemented to support the Swish-e 2.4 config options, but those will likely take a backseat to getting a working swish3 Perl script running with SWISH::Prog and SWISH::Prog::Xapian.

Introduction to Swish3

Swish is the Simple Web Indexing System for Humans. Swish is an information retrieval tool. It is not a search engine, but can be used as an integral part of creating a search engine. Swish gathers, parses, indexes and searches document collections. A collection can be any set of real or virtual documents: web pages, database rows, PDFs or office files, or anything else that can be converted to text.

Swish3 is version three of Swish. Kevin Hughes wrote the original version in 1994. In 2000, the project was updated and released as Swish-e version 2 (the -e is for Enhanced). Swish3 is the third phase in the evolution of the project.

In this document, the name Swish will refer to the entire project, without regard to a particular version. Swish-e will refer specifically to version 2.x. Swish3 will refer specifically to version three.

Anatomy of a Search Tool Chain

The following description could apply to any search system or information retrival project, not just Swish. First we'll look at the various parts of the system, then look at how they are implemented in Swish3.

Every search system implements the following chain of features:

  • aggregator

    An aggregator assembles documents into a collection. It can be as simple as a filesystem tool like the Unix find command or as sophisticated as a web crawler. An aggregator selects documents based on various criteria: content, MIME type or format, date, author, URL, or any other criteria that you desire.

  • normalizer

    A normalizer verifies that all documents the aggregator collects are in a format that the analyzer can parse. For example, a binary file format like PDF is converted to HTML or unmarked text. The same is true for all office file formats, PostScript, etc.

  • analyzer

    An analyzer examines the text supplied via the aggegrator/normalizer steps. The analyzer does several things, some of them optional:

    • parsing

      Separates text from any surrounding markup, optionally remembering the context (tag) in which text was found.

    • case folding

      Changes the text to all lowercase or all uppercase, to make comparisons easier.

    • tokenizing

      Splitting a string of text into tokens or words.

    • stemming

      Using one of a variety of word-stemming algorithms, tries to discover the root stem of each word.

    • customization

      Many advanced analyzers offer some level of customization to apply at some point in the analysis, whether it be synonym matching or other linguistic logic.

  • indexer

    An indexer stores basic document metadata and token (word) information in an index for fast and efficient retrieval.

  • searcher

    A searcher parses a user query using the same logic used by the analyser when processing the original document collection, applies some well-defined rules for matching documents in the index, and then returns results, typically a list or iterator of matching documents.

Now let's look at how Swish3 implements these five features.

A Library, Not a Command

The first thing to know about Swish3 is that, unlike previous versions of Swish, there is not a single Swish3 implementation.

That might sound confusing at first, because it is a significant departure from earlier versions of Swish, where there was a primary program, written in C, which handled all five links in the search chain. Swish3 takes a different approach.

Swish3 is primarily a C library called libswish3 . The library has a well-defined list of public functions and data structures that aim to fill a particular void in the world of information retrieval tools: analyzing HTML and XML documents.

Swish3 takes as its starting point the -S prog feature of Swish-e, where you can define your own aggregator/normalizer program, and makes that Swish3's central feature. Swish3 extends the -S prog API to include additional header values, and adds the same MIME-type-matching feature as the popular Apache web server.

Swish3 has no native indexer or searcher features [TODO: this might change if the 2.6 BDB backend is ported]. Nor does it have any aggregator or normalizer features. Swish3 is primarily an analyzer.

The Swish3 distribution does come with some examples of how to write Swish3 applications, including an example program for using the popular Xapian library. And there is a Perl implementation based on the SWISH::Prog package.

So How Does It Work?

libswish3 defines hooks or callbacks where you can override the default behaviour of the analyzer. These hooks are intended for making it easy to plug libswish3 into the indexing chain.

Here's one example. If you wanted to index a web site, you might use an aggregator/normalizer tool like Swish-e's spider.pl . spider.pl will print its output on stdout.

 % spider.pl your_config > spider_output

Then you could use a program like swish_xapian to analyze and index the output:

 % swish_xapian -c swish.conf - < spider_output

If you look at the source for the swish_xapian program, in the libswish3 distribution, you will see that there is a handler function defined that takes the output of the libswish3 parsing function and adds it to a Xapian index.

See Also

This document provides an overview of Swish3's anatomy. You might also be interested in these docs:

Migrating from Swish-e to Swish3

If you haven't already, read the Introduction to Swish3 document first.

This document is intended for users already familiar with Swish-e version 2.x who want to migrate to using Swish3.

The Tool Chain

Swish3 is intended to be one part of a search system tool chain. In this section we will look at how Swish-e implements each of the tool chain features, and then compare it to Swish3.

Aggregator

Swish-e has two built-in aggregators, for filesystem and web, indicated with the -S flag to the swish-e command. Swish-e also has a third -S option called prog , which is short for program . The program is an aggregator that you define. Swish-e ships with several example aggregators, including a filesystem crawler called DirTree.pl and a web crawler called spider.pl . There are also example aggregators for pulling data from a database and for specific kinds of documents, like Hypermail mail archives.

Swish3 has no built-in aggregators. Instead, Swish3 takes the -S prog approach of defining an API for external aggregators to follow.

Normalizer

Swish-e has a feature called FileFilter which allows you define an external program to call if a document's name matches a particular pattern. The file is handed to the external program and the output of the external program is treated as the contents of the document. For example, you can specify that all documents that end with .pdf are first filtered through the pdftotext command.

Swish-e also comes with a set of Perl modules bundled together as SWISH::Filter . SWISH::Filter is used by the external aggregators like DirTree.pl and spider.pl , thus making those programs both aggregators and normalizers.

Swish3 has no built-in normalizer or feature like FileFilter . Instead, Swish3 assumes that something like SWISH::Filter will be used to standardize documents before they are handed to Swish3.

Configuration

One of the biggest changes is the configuration file format. Swish3 uses XML-style configuration files, and supports a subset of the configuration options available in Swish-e.

This section documents the configuration options supported in Swish3.

See Also
Ruby on Rails

I do not use it myself, preferring Perl+Catalyst. But this is an interesting perspective on deployment issues at a major shared hosting provider and as such, is not limited to RoR.

Swish3 Status 30 March 2008

More progress with Swish3.

  • There is now a swish_xapian.cpp C++ example for using libswish3 with a Xapian backend. All that is complete is the indexing portion; still TODO is the search part. Still, a significant thing that it was so easy to build a search engine.
  • The swish.xml header format is complete and can now read/write the header file. Need to add that part to the swish_xapian.cpp example.
  • Squashed some long-standing memory leaks when using the filehandle functions.
Little by little.

Swish3 Status 2008-03-15

I've finally gotten back to Swish3 development after several months away. Hard to believe I've been working on this project for something close to 3 years now.

Lately I have been focusing on the following things:

Header file format
Because Swish3 will have multiple IR backends, it is important that there be a consistent index metadata file that describes the MetaNames, Properties, and tokenizing information, just like the Swish2 header does. Just as with the config file format, it makes sense to define the header file format as XML, since we already have a robust XML parser for free. To make it simple, I have defined the header file XML schema to be the same as the config file schema. In short, you configure Swish3 by creating a header file. The "real" header file will be more strict about explicitly naming all the expected attribute values, numbering the MetaName/Property ids, etc. But the idea is simple: a single XML schema.

I have written the code to read header/config file format and create a swish_Config object. There's also code for merging 2 swish_Config objects together, so that you can define a config file to override an existing header file.

Still TODO is the code for writing the header file.

The Perl bindings have been updated to reflect the new swish_Config API. This required a great deal of reworking and thinking about the Perl API. I had to rewrite things a few times to get a workable solution. The key Perl mantra is "objects on demand." I.e., do not define any Perl objects that wrap C pointers and try to keep them on the XS side. Instead, create all Perl objects "just in time" as part of the get_* method call. This makes reference counting much simpler.

MetaNames and PropertyNames
These now have their own C API with swish_MetaName and swish_Property structs. These relate directly to the header file format and swish_Config. There will end up being a separate PropertyName API for search results. I still think we're going to have to port the Swish2 PropertyNames storage/retrieval code to Swish3 and have a backend-indepedent index.prop file. The issue with this is going to be scaling. One other thought I've had is storing properties in a SQLite db. That route won't allow for presorted properties, but does have the advantage of being much more transparent and de-buggable.

SWISH::Prog
I have moved the SWISH::Prog svn tree to svn.swish-e.org from peknet.com. I also moved SWISH::Filter (and likely will eventually move SWISH::API::More and its cousins).

SWISH::Prog will form the framework for the Perl implementation of Swish3. I know there are some folks who don't like the idea of Swish3 being so Perl-centric. To that I can say only, tough luck. :)

Seriously though, my perspective is that there will be multiple Swish3 implementations. The one I am working on is in Perl using SWISH::Prog. There's nothing to stop someone from implementing one in C or C++ or Java or whatever. libswish3 provides the parsing/tokenizing piece missing from other IR projects, and it is a library for the very reason that implementing a Swish3 program should be language-neutral. If you can link against a C library, then you can write a Swish3 program. The header file API is well documented; the backend is supposed to be pluggable. It's all about the API.

I do intend to write a swish_xapian.cpp program eventually, showing how to implement a C++ Swish3 program with Xapian. That could be the fallback program if you really don't want to use Perl.

Documentation
I've stared a swish_intro.7 and swish_migration.7 set of docs. swish_intro will outline the aggregator/normalizer/analyzer/indexer/searcher philosophy and the outline of the libswish3 API. swish_migration will discuss differences in Swish2 vs Swish3 and how you can convert your config files and move to using Swish3.

Why I sleep so little