We've made the joke many times, but the resemblance in this montage is uncanny.

There's been quite a bit of activity in the last month.
- The C++ Xapian example now can search as well as index, and there are Perl equivalents using Search::Xapian checked into svn as well. The C++ code will read/write the swish.xml header; the Perl does not (yet).
- The meta/prop id unique check now uses a hash for quick look up.
- The test suite for libswish3 is totally restructured. Now using Perl's Test::Harness and added a slew of new meta/prop tests. Alongside that were additions to the NamedBuffer debugging output to print each substring in the buffer.
- Several new string-related utility functions for converting ints to strings and back. Also a new config hash for configration options that use a StringList instead of a simple string.
- Fixed some mem leaks in the example .c programs and added more info to the swish_lint usage() output (including reminders about the various SWISH_DEBUG* env var values).
Swish is the Simple Web Indexing System for Humans. Swish is an information retrieval tool. It is not a search engine, but can be used as an integral part of creating a search engine. Swish gathers, parses, indexes and searches document collections. A collection can be any set of real or virtual documents: web pages, database rows, PDFs or office files, or anything else that can be converted to text.
Swish3 is version three of Swish. Kevin Hughes wrote the original version in 1994. In 2000, the project was updated and released as Swish-e version 2 (the -e is for Enhanced). Swish3 is the third phase in the evolution of the project.
In this document, the name Swish will refer to the entire project, without regard to a particular version. Swish-e will refer specifically to version 2.x. Swish3 will refer specifically to version three.
The following description could apply to any search system or information retrival project, not just Swish. First we'll look at the various parts of the system, then look at how they are implemented in Swish3.
Every search system implements the following chain of features:
- aggregator
An aggregator assembles documents into a collection. It can be as simple as a filesystem tool like the Unix find command or as sophisticated as a web crawler. An aggregator selects documents based on various criteria: content, MIME type or format, date, author, URL, or any other criteria that you desire.
- normalizer
A normalizer verifies that all documents the aggregator collects are in a format that the analyzer can parse. For example, a binary file format like PDF is converted to HTML or unmarked text. The same is true for all office file formats, PostScript, etc.
- analyzer
An analyzer examines the text supplied via the aggegrator/normalizer steps. The analyzer does several things, some of them optional:
- parsing
Separates text from any surrounding markup, optionally remembering the context (tag) in which text was found.
- case folding
Changes the text to all lowercase or all uppercase, to make comparisons easier.
- tokenizing
Splitting a string of text into tokens or words.
- stemming
Using one of a variety of word-stemming algorithms, tries to discover the root stem of each word.
- customization
Many advanced analyzers offer some level of customization to apply at some point in the analysis, whether it be synonym matching or other linguistic logic.
- parsing
- indexer
An indexer stores basic document metadata and token (word) information in an index for fast and efficient retrieval.
- searcher
A searcher parses a user query using the same logic used by the analyser when processing the original document collection, applies some well-defined rules for matching documents in the index, and then returns results, typically a list or iterator of matching documents.
Now let's look at how Swish3 implements these five features.
The first thing to know about Swish3 is that, unlike previous versions of Swish, there is not a single Swish3 implementation.
That might sound confusing at first, because it is a significant departure from earlier versions of Swish, where there was a primary program, written in C, which handled all five links in the search chain. Swish3 takes a different approach.
Swish3 is primarily a C library called libswish3 . The library has a well-defined list of public functions and data structures that aim to fill a particular void in the world of information retrieval tools: analyzing HTML and XML documents.
Swish3 takes as its starting point the -S prog feature of Swish-e, where you can define your own aggregator/normalizer program, and makes that Swish3's central feature. Swish3 extends the -S prog API to include additional header values, and adds the same MIME-type-matching feature as the popular Apache web server.
Swish3 has no native indexer or searcher features [TODO: this might change if the 2.6 BDB backend is ported]. Nor does it have any aggregator or normalizer features. Swish3 is primarily an analyzer.
The Swish3 distribution does come with some examples of how to write Swish3 applications, including an example program for using the popular Xapian library. And there is a Perl implementation based on the SWISH::Prog package.
libswish3 defines hooks or callbacks where you can override the default behaviour of the analyzer. These hooks are intended for making it easy to plug libswish3 into the indexing chain.
Here's one example. If you wanted to index a web site, you might use an aggregator/normalizer tool like Swish-e's spider.pl . spider.pl will print its output on stdout.
% spider.pl your_config > spider_output
Then you could use a program like swish_xapian to analyze and index the output:
% swish_xapian -c swish.conf - < spider_output
If you look at the source for the swish_xapian program, in the libswish3 distribution, you will see that there is a handler function defined that takes the output of the libswish3 parsing function and adds it to a Xapian index.
This document provides an overview of Swish3's anatomy. You might also be interested in these docs:
If you haven't already, read the Introduction to Swish3 document first.
This document is intended for users already familiar with Swish-e version 2.x who want to migrate to using Swish3.
Swish3 is intended to be one part of a search system tool chain. In this section we will look at how Swish-e implements each of the tool chain features, and then compare it to Swish3.
Swish-e has two built-in aggregators, for filesystem and web, indicated with the -S flag to the swish-e command. Swish-e also has a third -S option called prog , which is short for program . The program is an aggregator that you define. Swish-e ships with several example aggregators, including a filesystem crawler called DirTree.pl and a web crawler called spider.pl . There are also example aggregators for pulling data from a database and for specific kinds of documents, like Hypermail mail archives.
Swish3 has no built-in aggregators. Instead, Swish3 takes the -S prog approach of defining an API for external aggregators to follow.
Swish-e has a feature called FileFilter which allows you define an external program to call if a document's name matches a particular pattern. The file is handed to the external program and the output of the external program is treated as the contents of the document. For example, you can specify that all documents that end with .pdf are first filtered through the pdftotext command.
Swish-e also comes with a set of Perl modules bundled together as SWISH::Filter . SWISH::Filter is used by the external aggregators like DirTree.pl and spider.pl , thus making those programs both aggregators and normalizers.
Swish3 has no built-in normalizer or feature like FileFilter . Instead, Swish3 assumes that something like SWISH::Filter will be used to standardize documents before they are handed to Swish3.
One of the biggest changes is the configuration file format. Swish3 uses XML-style configuration files, and supports a subset of the configuration options available in Swish-e.
This section documents the configuration options supported in Swish3.
I do not use it myself, preferring Perl+Catalyst. But this is an interesting perspective on deployment issues at a major shared hosting provider and as such, is not limited to RoR.
More progress with Swish3.
- There is now a swish_xapian.cpp C++ example for using libswish3 with a Xapian backend. All that is complete is the indexing portion; still TODO is the search part. Still, a significant thing that it was so easy to build a search engine.
- The swish.xml header format is complete and can now read/write the header file. Need to add that part to the swish_xapian.cpp example.
- Squashed some long-standing memory leaks when using the filehandle functions.
I've finally gotten back to Swish3 development after several months away. Hard to believe I've been working on this project for something close to 3 years now.
Lately I have been focusing on the following things:
- Header file format
-
Because Swish3 will have multiple IR backends, it is important that there
be a consistent index metadata file that describes the MetaNames, Properties,
and tokenizing information, just like the Swish2 header does. Just as with the config
file format, it makes sense to define the header file format as XML, since we
already have a robust XML parser for free. To make it simple, I have defined
the header file XML schema to be the same as the config file schema. In short,
you configure Swish3 by creating a header file. The "real" header file will
be more strict about explicitly naming all the expected attribute values,
numbering the MetaName/Property ids, etc. But the idea is simple: a single
XML schema.
I have written the code to read header/config file format and create a swish_Config object. There's also code for merging 2 swish_Config objects together, so that you can define a config file to override an existing header file.
Still TODO is the code for writing the header file.
The Perl bindings have been updated to reflect the new swish_Config API. This required a great deal of reworking and thinking about the Perl API. I had to rewrite things a few times to get a workable solution. The key Perl mantra is "objects on demand." I.e., do not define any Perl objects that wrap C pointers and try to keep them on the XS side. Instead, create all Perl objects "just in time" as part of the get_* method call. This makes reference counting much simpler.
- MetaNames and PropertyNames
- These now have their own C API with swish_MetaName and swish_Property structs. These relate directly to the header file format and swish_Config. There will end up being a separate PropertyName API for search results. I still think we're going to have to port the Swish2 PropertyNames storage/retrieval code to Swish3 and have a backend-indepedent index.prop file. The issue with this is going to be scaling. One other thought I've had is storing properties in a SQLite db. That route won't allow for presorted properties, but does have the advantage of being much more transparent and de-buggable.
- SWISH::Prog
- I have moved the SWISH::Prog svn tree to svn.swish-e.org from peknet.com. I also
moved SWISH::Filter (and likely will eventually move SWISH::API::More and its cousins).
SWISH::Prog will form the framework for the Perl implementation of Swish3. I know there are some folks who don't like the idea of Swish3 being so Perl-centric. To that I can say only, tough luck. :)
Seriously though, my perspective is that there will be multiple Swish3 implementations. The one I am working on is in Perl using SWISH::Prog. There's nothing to stop someone from implementing one in C or C++ or Java or whatever. libswish3 provides the parsing/tokenizing piece missing from other IR projects, and it is a library for the very reason that implementing a Swish3 program should be language-neutral. If you can link against a C library, then you can write a Swish3 program. The header file API is well documented; the backend is supposed to be pluggable. It's all about the API.
I do intend to write a swish_xapian.cpp program eventually, showing how to implement a C++ Swish3 program with Xapian. That could be the fallback program if you really don't want to use Perl.
- Documentation
- I've stared a swish_intro.7 and swish_migration.7 set of docs. swish_intro will outline the aggregator/normalizer/analyzer/indexer/searcher philosophy and the outline of the libswish3 API. swish_migration will discuss differences in Swish2 vs Swish3 and how you can convert your config files and move to using Swish3.