The State of Swish3

I've been working on the next version of Swish-e (codename: Swish3) for about a year now. Squirreling away hours in the late evening after the kid is asleep, with one ear on the TV my wife is watching in the other room, and eyes on the screen in here. I've been learning C, UTF-8 and Perl's XS glue language. It's been a very stretching year.

This little corner of the blog will record my progress.

Dezi search platform

This week I announced the initial release of Dezi, a new search platform based on Swish3, Apache Lucy, OpenSearch and Plack.

As of about 15 minutes ago, there are now PHP and Perl clients available.

libswish3 1.0.0 released

I am happy to announce the 1.0.0 release of libswish3:

http://swish-e.org/swish3/libswish3-1.0.0.tar.gz

libswish3 is at the core of multiple Swish3 implementations, and has reached a stable enough API that a 1.0.0 release seems appropriate.

From the README:

libswish3 is a document parser compatible with the Swish-e 2.4 -S prog API. libswish3 is a C library for parsing documents into a data structure that can then be stored and searched with a variety of IR backends.

There are currently four different implementations available of Swish3.

  • swish_xapian (C++ using libxapian, included in libswish3 distribution)
  • SWISH::Prog::Xapian (Perl using Search::Xapian)
  • SWISH::Prog::Lucy (Perl using Apache Lucy)
  • SWISH::Prog::KSx (Perl using KinoSearch)

All the Perl implementations are available from CPAN. They each rely on SWISH::3 (the Perl bindings to libswish3) and the core SWISH::Prog project, a Perl rewrite of the swish-e 2.x C binary and accompanying helper scripts. The SWISH::Prog distribution includes a 'swish3' command line interface with options very similar to the swish-e 2.x command line tool.

Xapian, KinoSearch and Apache Lucy all offer robust UTF-8 and incremental indexing support, as well as the ability to scale to many millions of documents across multiple servers.

You can read more about Swish3 at the devel site.

UPDATE: Mailing list announcement here.

Search::OpenSearch::Server with REST API

Just uploaded several modules to CPAN that together implement a full REST API for KinoSearch indexes, using Search::OpenSearch::Server::Plack.

% curl -XPOST http://localhost:5000/foo \ -d '<doc><title>bar</title>foo</doc>' \ -H 'Content-Type: application/xml' [response:] { "success":1, "doc":{ "orgs":[], "places":[], "people":[], "topics":[], "summary":"", "title":"bar", "author":[] }, "total":"21581", "code":"200" }

The modules are:

  • Search::OpenSearch 0.11
  • Search::OpenSearch::Server 0.05
  • Search::OpenSearch::Engine::KSx 0.08
  • SWISH::Prog::KSx 0.17
  • SWSIH::Prog 0.49
  • CPAN test failures

    SWISH::3 0.08_04 is passing all tests all over the CPAN testers universe, so that is encouraging.

    However, some reports (notably on FreeBSD) report false failures because of a Wstat issue.

    I've posted about it at PerlMonks and hope someone out there has an easy fix.

    Update: finally found a fix for this. The problem is that Perl has its own my_setenv() function that interferes with the native setenv() called by libswish3.c. The fix was to set the magic Perl var PL_use_safe_putenv as shown here. This took many hours and googling to track down. Glad to be done with it (I hope!).

    Swish3 progress report

    There's been a ton of work on Swish3 in the last year. I've actually started planning a 1.0 release, after 5 years of work.

    Lately I've been focusing on three things: (1) making the Perl bindings easier to install; (2) indexing of compressed documents; and (3) supporting XInclude of document fragments. The first is accomplished: you can install the entire library via CPAN. The last two are aimed at large doc sets where I want to keep the XML compressed on disk for space reasons, and where I want to re-use subsets of the document collections in building multiple indexes.

    Frozen Perl 2010

    It's been a long week, culminating today in Frozen Perl 2010, a Perl conference for and by Perl hackers, here in the Twin Cities. I gave two talks at today's conference, one on Swish3 and the other on Devel::NYTProf and Search::Tools. Both talks seemed well-received.

    In the process of preparing the talks I also released a few new, related modules to CPAN this week:

    Search::OpenSearch
    OpenSearch server glue for KinoSearch and Swish-e 2.x via SWISH::Prog. There's a demo Plack app and ExtJS, using both search engines as part of the slides for my Swish3 talk.

    I think OpenSearch is very cool and look forward to doing more with that spec, including adding more features (e.g. facets) to Search::OpenSearch.

    Search::Query
    Search::Query now has support for SQL and SWISH Dialects. I hope to add KinoSearch and Xapian dialects soon. The Search::Query::Parser now has (undocumented and experimental) support for range queries, so that you can say:
    foo=( 1..4 )
    and that'll be expanded to
    foo=( 1 OR 2 OR 3 OR 4 )
    when the Dialect query object is stringified. Handy for things like ranges of dates, which is how I am using it as $work.
    Search::Tools, SWISH::API::*
    New releases of these older modules as well, with some bug fixes and refactoring to support the Search::Query.
    So, yes. A busy week.

    I enjoyed hearing other folks' talks today at Frozen Perl. There was a good variety: pack/unpack, Unicode, i18n and best practice-related presentations. I met some new people, renewed friendships with folks I already knew, and drank lots of free coffee. The cookies were good too.

    libswish3.3
    NAME

    libswish3 - Swish3 C library

    SYNOPSIS
    Data Structures
     
     struct swish_3
     {
         int             ref_cnt;
         void           *stash;
         swish_Config   *config;
         swish_Analyzer *analyzer;
         swish_Parser   *parser;
     };
     
     struct swish_StringList
     {
         unsigned int    n;
         unsigned int    max;
         xmlChar**       word;
     };
     
     
     struct swish_Config
     {
         int                          ref_cnt;
         void                        *stash;      /* for bindings */
         xmlHashTablePtr              misc;
         xmlHashTablePtr              properties;
         xmlHashTablePtr              metanames;
         xmlHashTablePtr              tag_aliases;
         xmlHashTablePtr              parsers;
         xmlHashTablePtr              mimes;
         xmlHashTablePtr              index;
         xmlHashTablePtr              stringlists;
         struct swish_ConfigFlags    *flags;      /* shortcuts for parsing */
     };
     
     struct swish_ConfigFlags
     {
         boolean         tokenize;
         boolean         cascade_meta_context;
         xmlHashTablePtr meta_ids;
         xmlHashTablePtr prop_ids;
         //xmlHashTablePtr contexts;
     };
     
     struct swish_NamedBuffer
     {
         int             ref_cnt;    /* for bindings */
         void           *stash;      /* for bindings */
         xmlHashTablePtr hash;       /* the meat */
     };
     
     struct swish_DocInfo
     {
         time_t              mtime;
         off_t               size;
         xmlChar *           mime;
         xmlChar *           encoding;
         xmlChar *           uri;
         unsigned int        nwords;
         xmlChar *           ext;
         xmlChar *           parser;
         xmlChar *           action;
         boolean             is_gzipped;
         int                 ref_cnt;
     };
     
     struct swish_MetaName
     {
         int                 ref_cnt;
         int                 id;
         xmlChar            *name;
         int                 bias;
         xmlChar            *alias_for;
     };
     
     struct swish_Property
     {
         int                 ref_cnt;
         int                 id;
         xmlChar            *name;
         boolean             ignore_case;
         int                 type;
         boolean             verbatim;
         xmlChar            *alias_for;
         unsigned int        max;
         boolean             sort;
         boolean             presort;
         unsigned int        sort_length;
     };
     
     struct swish_Token
     {
         unsigned int        pos;            // this token's position in document
         swish_MetaName     *meta;
         xmlChar            *value;
         xmlChar            *context;
         unsigned int        offset;
         unsigned int        len;
         int                 ref_cnt;
     };
     
     struct swish_TokenList
     {
         unsigned int        n;
         unsigned int        pos;            // track position in document
         xmlHashTablePtr     contexts;       // cache contexts
         xmlBufferPtr        buf;
         swish_Token**       tokens;
         int                 ref_cnt;
     };
     
     struct swish_TokenIterator
     {
         swish_TokenList     *tl;
         swish_Analyzer      *a;
         unsigned int         pos;           // position in iteration
         int                  ref_cnt;
     };
     
     struct swish_Tag
     {
         xmlChar            *raw;            // tag as libxml2 sees it
         xmlChar            *baked;          // tag as libswish3 sees it
         xmlChar            *context;
         struct swish_Tag   *next;
         unsigned int        n;
     };
     
     struct swish_TagStack
     {
         swish_Tag         *head;
         swish_Tag         *temp;
         unsigned int       count;
         char              *name;       // debugging aid -- name of the stack
     };
     
     struct swish_Analyzer
     {
         unsigned int           maxwordlen;         // max word length
         unsigned int           minwordlen;         // min word length
         boolean                tokenize;           // should we parse into TokenList
         int                  (*tokenizer) (swish_TokenIterator*, xmlChar*, swish_MetaName*, xmlChar*);
         xmlChar*             (*stemmer)   (xmlChar*);
         boolean                lc;                 // should tokens be lowercased
         void                  *stash;              // for script bindings
         void                  *regex;              // optional regex
         int                    ref_cnt;            // for script bindings
     };
     
     struct swish_Parser
     {
         int                    ref_cnt;             // for script bindings
         void                 (*handler)(swish_ParserData*); // handler reference
         void                  *stash;               // for script bindings
         int                    verbosity;           
     };
     
     struct swish_ParserData
     {
         swish_3               *s3;                 // main object
         xmlBufferPtr           meta_buf;           // tmp MetaName buffer
         xmlBufferPtr           prop_buf;           // tmp Property buffer
         xmlChar               *tag;                // current tag name
         swish_DocInfo         *docinfo;            // document-specific properties
         boolean                no_index;           // toggle flag. should buffer be indexed.
         boolean                is_html;            // shortcut flag for html parser
         boolean                bump_word;          // boolean for moving word position/adding space
         unsigned int           offset;             // current offset position
         swish_TagStack        *metastack;          // stacks for tracking the tag => metaname
         swish_TagStack        *propstack;          // stacks for tracking the tag => property
         swish_TagStack        *domstack;           // stacks for tracking xml/html dom tree
         xmlParserCtxtPtr       ctxt;               // so we can free at end
         swish_TokenIterator   *token_iterator;     // token container
         swish_NamedBuffer     *properties;         // buffer all properties
         swish_NamedBuffer     *metanames;          // buffer all metanames
     };
     
     
    Global Functions
     void            swish_setup();
     const char *    swish_lib_version();
     const char *    swish_libxml2_version();
     
    Top-Level Functions
     swish_3 *       swish_3_init( void (*handler) (swish_ParserData *), void *stash );
     void            swish_3_free( swish_3 *s3 );
     int             swish_parse_file( swish_3 * s3, xmlChar *filename);
     unsigned int    swish_parse_fh( swish_3 * s3, FILE * fh);
     int             swish_parse_buffer( swish_3 * s3, xmlChar * buf);
     unsigned int    swish_parse_directory( swish_3 *s3, xmlChar *dir, boolean follow_symlinks );
     
    I/O Functions
     xmlChar *   swish_io_slurp_fh( FILE * fh, unsigned long flen, boolean binmode );
     xmlChar *   swish_io_slurp_file_len( xmlChar *filename, off_t flen, boolean binmode );
     xmlChar *   swish_io_slurp_gzfile_len( xmlChar *filename, off_t flen, boolean binmode );
     xmlChar *   swish_io_slurp_file( xmlChar *filename, off_t flen, boolean is_gzipped, boolean binmode );
     long int    swish_io_count_operable_file_lines( xmlChar *filename );
     boolean     swish_io_is_skippable_line( xmlChar *str );
     
    Filesystem Functions
     boolean     swish_fs_file_exists( xmlChar *filename );
     boolean     swish_fs_is_dir( xmlChar *path );
     boolean     swish_fs_is_file( xmlChar *path );
     boolean     swish_fs_is_link( xmlChar *path );
     off_t       swish_fs_get_file_size( xmlChar *path );
     time_t      swish_fs_get_file_mtime( xmlChar *path );
     xmlChar *   swish_fs_get_file_ext( xmlChar *url );
     boolean     swish_fs_looks_like_gz( xmlChar *file );
     
    Hash Functions
     int         swish_hash_add( xmlHashTablePtr hash, xmlChar *key, void * value );
     int         swish_hash_replace( xmlHashTablePtr hash, xmlChar *key, void *value );
     int         swish_hash_delete( xmlHashTablePtr hash, xmlChar *key );
     boolean     swish_hash_exists( xmlHashTablePtr hash, xmlChar *key );
     int         swish_hash_exists_or_add( xmlHashTablePtr hash, xmlChar *key, xmlChar *value );
     void        swish_hash_merge( xmlHashTablePtr hash1, xmlHashTablePtr hash2 );
     void *      swish_hash_fetch( xmlHashTablePtr hash, xmlChar *key );
     void        swish_hash_dump( xmlHashTablePtr hash, const char *label );
     xmlHashTablePtr swish_hash_init(int size);
     void        swish_hash_free( xmlHashTablePtr hash );
     
    Memory Functions
     void        swish_mem_init();
     void *      swish_xrealloc(void *ptr, size_t size);
     void *      swish_xmalloc( size_t size );
     void        swish_xfree( void *ptr );
     void        swish_mem_debug();
     long int    swish_memcount_get();
     void        swish_memcount_dec();
     xmlChar *   swish_xstrdup( const xmlChar * ptr );
     xmlChar *   swish_xstrndup( const xmlChar * ptr, int len );
     
    Time Functions
     double      swish_time_elapsed(void);
     double      swish_time_cpu(void);
     char *      swish_time_print(double time);
     char *      swish_time_print_fine(double time);
     char *      swish_time_format(time_t epoch);
     
    Error Functions
     void        swish_set_error_handle( FILE *where );
     void        swish_croak(const char *file, int line, const char *func, const char *msg,...);
     void        swish_warn(const char *file, int line, const char *func, const char *msg,...);
     void        swish_debug(const char *file, int line, const char *func, const char *msg,...);
     
    String Functions
     char *              swish_get_locale();
     void                swish_verify_utf8_locale();
     boolean             swish_is_ascii( xmlChar *str );
     int                 swish_bytes_in_wchar( int wchar );
     int                 swish_utf8_chr_len( xmlChar *utf8 );
     uint32_t            swish_utf8_codepoint( xmlChar *utf8 );
     int                 swish_utf8_num_chrs( xmlChar *utf8 );
     void                swish_utf8_next_chr( xmlChar *s, int *i );
     void                swish_utf8_prev_chr( xmlChar *s, int *i );
     xmlChar *           swish_str_escape_utf8( xmlChar *utf8 );
     xmlChar *           swish_str_unescape_utf8( xmlChar *ascii );
     wchar_t *           swish_locale_to_wchar(xmlChar * str);
     xmlChar *           swish_wchar_to_locale(wchar_t * str);
     wchar_t *           swish_wstr_tolower(wchar_t *s);
     xmlChar *           swish_str_tolower(xmlChar *s );
     xmlChar *           swish_utf8_str_tolower(xmlChar *s);
     xmlChar *           swish_ascii_str_tolower(xmlChar *s);
     xmlChar *           swish_str_skip_ws(xmlChar *s);
     void                swish_str_trim_ws(xmlChar *string);
     void                swish_str_ctrl_to_ws(xmlChar *s);
     boolean             swish_str_all_ws(xmlChar * s);
     boolean             swish_str_all_ws_len(xmlChar * s, int len);
     void                swish_debug_wchars( const wchar_t * widechars );
     int                 swish_wchar_t_comp(const void *s1, const void *s2);
     int                 swish_sort_wchar(wchar_t *s);
     swish_StringList *  swish_stringlist_build(xmlChar *line);
     swish_StringList *  swish_stringlist_init();
     void                swish_stringlist_free(swish_StringList *sl);
     unsigned int        swish_stringlist_add_string(swish_StringList *sl, xmlChar *str);
     void                swish_stringlist_merge(swish_StringList *sl1, swish_StringList *sl2);
     swish_StringList *  swish_stringlist_copy(swish_StringList *sl);
     swish_StringList *  swish_stringlist_parse_sort_string(xmlChar *sort_string, swish_Config *cfg);
     void                swish_stringlist_debug(swish_StringList *sl);
     int                 swish_string_to_int( char *buf );
     boolean             swish_string_to_boolean( char *buf );
     xmlChar *           swish_int_to_string( int val );
     xmlChar *           swish_long_to_string( long val );
     xmlChar *           swish_double_to_string( double val );
     xmlChar *           swish_date_to_string( int y, int m, int d );
     char                swish_get_C_escaped_char(xmlChar *s, xmlChar **se);
     
    Configuration Functions
     swish_Config *      swish_config_init();
     void                swish_config_set_default( swish_Config *config );
     void                swish_config_merge( swish_Config *config1, swish_Config *config2 );
     swish_Config *      swish_config_add( swish_Config * config, xmlChar * conf );
     swish_Config *      swish_config_parse( swish_Config * config, xmlChar * conf );
     void                swish_config_debug( swish_Config * config );
     void                swish_config_free( swish_Config * config);
     xmlHashTablePtr     swish_mime_defaults();
     xmlChar *           swish_mime_get_type( swish_Config * config, xmlChar * fileext );
     xmlChar *           swish_mime_get_parser( swish_Config * config, xmlChar *mime );
     void                swish_config_test_alias_fors( swish_Config *c );
     swish_ConfigFlags * swish_config_flags_init();
     void                swish_config_flags_free( swish_ConfigFlags *flags );
     void                swish_config_test_alias_fors( swish_Config *config );
     void                swish_config_test_unique_ids( swish_Config *config );
     
     
    Parser Functions
     swish_Parser *  swish_parser_init( void (*handler) (swish_ParserData *) );
     void            swish_parser_free( swish_Parser * parser );
     
    Token Functions
     swish_TokenList *   swish_token_list_init();
     void                swish_token_list_free( swish_TokenList *tl );
     int                 swish_token_list_add_token(    
                                             swish_TokenList *tl, 
                                             xmlChar *token,
                                             int token_len,
                                             swish_MetaName *meta,
                                             xmlChar *context );
     int                 swish_token_list_set_token(
                                             swish_TokenList *tl,
                                             xmlChar *token,
                                             int len );
     swish_Token *       swish_token_init();
     void                swish_token_free( swish_Token *t );
     swish_TokenIterator *swish_token_iterator_init( swish_Analyzer *a );
     void                swish_token_iterator_free( swish_TokenIterator *ti );
     swish_Token *       swish_token_iterator_next_token( swish_TokenIterator *it );
     int                 swish_tokenize(     swish_TokenIterator *ti, 
                                             xmlChar *buf, 
                                             swish_MetaName *meta,
                                             xmlChar *context );
     int                 swish_tokenize_ascii(    
                                             swish_TokenIterator *ti, 
                                             xmlChar *buf, 
                                             swish_MetaName *meta,
                                             xmlChar *context );
     int                 swish_tokenize_utf8(    
                                             swish_TokenIterator *ti, 
                                             xmlChar *buf, 
                                             swish_MetaName *meta,
                                             xmlChar *context );
     void                swish_token_list_debug( swish_TokenIterator *it );
     xmlChar *           swish_token_list_get_token_value( swish_TokenList *tl, swish_Token *t );
     void                swish_token_debug( swish_Token *t );
     
     
    Analyzer Functions
     swish_Analyzer *    swish_analyzer_init( swish_Config * config );
     void                swish_analyzer_free( swish_Analyzer * analyzer );
     
    DocInfo Functions
     swish_DocInfo *     swish_docinfo_init();
     void                swish_docinfo_free( swish_DocInfo * ptr );
     int                 swish_docinfo_check(swish_DocInfo * docinfo, swish_Config * config);
     int                 swish_docinfo_from_filesystem(  xmlChar *filename, 
                                                         swish_DocInfo * i, 
                                                         swish_ParserData *parser_data );
     void                swish_docinfo_debug( swish_DocInfo * docinfo );
     
    Buffer Functions
     swish_NamedBuffer * swish_nb_init( xmlHashTablePtr confhash );
     void                swish_nb_free( swish_NamedBuffer * nb );
     void                swish_nb_debug( swish_NamedBuffer * nb, xmlChar * label );
     void                swish_nb_add_buf( swish_NamedBuffer *nb, 
                                           xmlChar * name,
                                           xmlBufferPtr buf, 
                                           xmlChar * joiner,
                                           boolean cleanwsp,
                                           boolean autovivify);
     void                swish_nb_add_str(   swish_NamedBuffer * nb, 
                                             xmlChar * name, 
                                             xmlChar * str,
                                             unsigned int len,
                                             xmlChar * joiner,
                                             boolean cleanwsp,
                                             boolean autovivify);
     void                swish_buffer_append( xmlBufferPtr buf, xmlChar * txt, int len );
     xmlChar*            swish_nb_get_value( swish_NamedBuffer* nb, xmlChar* key );
     
    Property Functions
     swish_Property *    swish_property_init( xmlChar *propname );
     void                swish_property_free( swish_Property *prop );
     void                swish_property_debug( swish_Property *prop );
     int                 swish_property_get_id( xmlChar *propname, xmlHashTablePtr properties );
     
    MetaName Functions
     swish_MetaName *    swish_metaname_init( xmlChar *name);
     void                swish_metaname_free( swish_MetaName *m );
     void                swish_metaname_debug( swish_MetaName *m );
     
    Header Functions
     boolean             swish_header_validate(char *filename);
     boolean             swish_header_merge(char *filename, swish_Config *c);
     swish_Config *      swish_header_read(char *filename);
     void                swish_header_write(char* filename, swish_Config* config);
     
    DESCRIPTION

    libswish3 is the core C library of Swish3 .

    libswish3 uses the GNOME Libxml2 library to parse words and metadata from XML, HTML and plain text files. libswish3 supports full UTF-8 encoding.

    libswish3 is a parsing tool for use with information retrieval (IR) libraries. Dynamic language bindings are available in the source distribution in the bindings directory.

    APIs

    The following APIs are defined:

    Parsing API

    libswish3 provides three basic input functions:

    • swish_parse_file()

    • swish_parse_fh()

    • swish_parse_buffer()

    Each of these functions takes a swish_Parser struct pointer and optional user_data .

    In addition:

    • The swish_parse_file() function takes a file path, which must be a valid file. Directories and links are not supported. The assumption is that you will use your calling code to recurse through directories and handle links.

    • swish_parse_buffer() takes a string representing the document headers and the full text of the document.

    • swish_parse_fh() takes a filehandle pointer, which if set to NULL, defaults to stdin.

    See the Headers API section for more information on using swish_parse_fh() and swish_parse_buffer().

    See the handler Function section for more information on how to deal with the data extracted by each of the swish_parse_* functions.

    Headers API

    The Headers API supports and extends the Swish-e -S prog feature, which allows you to feed the indexer with output from another prog ram. The API has been extended from Swish-e's to allow for MIME types and more congruence with the HTTP 1.1 specification.

    See SWISH-RUN documentation in the Swish-e distribution for the Swish-e version 2 headers API.

    This is the libswish3 implementation. See SWISH::Prog::Headers for a simple Perl-based way of generating the proper headers.

    • Content-Location

      Swish-e name: Path-Name

      The name of the document. May be any string: an ID of a record in a database, a URL or a simple file name. The string is stored in the swish_DocInfo uri struct member, which is often used as the primary identifier of a document in an index.

      This header is required.

    • Content-Length

      The length in bytes of the document, starting after the blank line separating the headers from the document itself. The value must be exactly the length of the document, including any extra line feeds or carriage returns at the end of the document.

      Example:

       Content-Location: foo.html
       Content-Length: 9
       The doc.\n
       ^^^^^^^^ ^
       12345678 9

      The value is stored in the swish_DocInfo size struct member.

      This header is required.

    • Last-Modified

      Swish-e name: Last-Mtime

      The last modification time of the document. The value must be an integer: the seconds since the Epoch on your system.

      If not present, will default to the current time.

      The value is stored in the swish_DocInfo mtime struct member.

      This header is not required.

    • Parser-Type

      Swish-e name: Document-Type

      Explicitly name the parser used for the document, rather than defaulting to the MIME type mapping based on Content-Type and/or Content-Location . The three parser types are:

      • XML

      • HTML

      • TXT

      The Swish-e values XML2 , XML* , HTML2 , HTML* , TXT2 , TXT* are also supported for compatibility, but they map to the three libswish3 types.

      The value is stored in the swish_DocInfo parser struct member.

      If not present, the document parser will be automatically chosen based on the following logic:

      • If a Content-Type is given, the parser mapped to that MIME type will be used. You may override the default mappings in your configuration. See Configuration API .

      • If no Content-Type is given, a MIME type will be guessed at based on the file extension of the document's Content-Location , and the parser mapped to that MIME type will be used.

      • Finally, if a MIME type is not identified, the parser defined in SWISH_DEFAULT_PARSER in libswish3.h will be used.

      See also Content-Type and Content-Location .

      This header is not required.

    • Content-Type

      The MIME type of the document. The libswish3 MIME type list is based on the Apache 2.0 version. See http://www.iana.org/assignments/media-types/ for the official registry.

      If not defined with Content-Type , the MIME type will be guessed based on the file extension in the Content-Location header. If the Content-Location string does not contain a file extension (as might be the case with non-URL value), or the file extension has no MIME mapping, then the MIME type will default to SWISH_DEFAULT_MIME as defined in libswish3.h .

      You may override the default extension-to-MIME mappings in your configuration. See Configuration API .

      The value is stored in the swish_DocInfo mime struct member.

      See also Content-Location and Parser-Type .

      This header is not required.

    • Update-Mode

      NOTE: This header exists only for backwards compatibility with Swish-e's incremental index feature. It may be deprecated in a future version of libswish3.

    Structures API

    Writing an effective handler function requires an understanding of some of the key libswish3 data structures.

    For more details on any of these structures, see the SYNOPSIS.

    swish_3

    The main data structure. A swish_3 object has a swish_Config, swish_Analyzer and swish_Parser object and delegates to each as appropriate.

    This is typically the only object you need to create and use.

    swish_Config

    A configuration object. This object is required for initializing both a swish_Analyzer object and a swish_Parser object.

    swish_Parser

    A parser object. Required for executing any of the three swish_parse_* functions.

    swish_ParserData

    A parser data object. This object is passed around internally by the libxml2 SAX2 handlers, and is eventually the object passed to the handler function pointer. See The handler Function .

    swish_TokenList

    A list of words or tokens. The object is typically accessed via a swish_TokenIterator, like this:

        // example of swish_TokenIterator
        swish_Token *t;
        while ((t = swish_next_token(token_iterator)) != NULL) {
            SWISH_DEBUG_MSG("\n\
            t->ref_cnt      = %d\n\
            t->pos          = %d\n\
            t->context      = %s\n\
            t->meta         = %d [%s]\n\
            t->len          = %d\n\
            t->value        = %s\n\
        ", t->ref_cnt, t->pos, t->context, t->meta->id, t->meta->name, t->len, t->value);
        }

    See the swish_debug_token_list() function from which the code above is taken.

    swish_Token

    An object representing one word or token. The word's position relative to other words, length, tag context and MetaName are all available in the object.

    swish_DocInfo

    An object describing metadata about the document itself: URI, MIME type, size, etc.

    swish_Analyzer

    The Analyzer object controls how the character content of a document is parsed: whether or not a WordList is created with a tokenizer, if the words (tokens) are lowercased or stemmed, etc.

    The handler Function

    The handler function pointer is the final link in the parsing chain. The function pointer is set in the swish_Parser object constructor, and is called by each of the swish_parse_* functions after the entire document has been parsed and (optionally) tokenized.

    The handler receives one argument: a swish_ParserData object containing all the metadata and words in the document.

    If all you wanted to do was print out a report about each document as it was parsed, your handler function might be as simple as:

     void
     my_handler( swish_ParserData * parse_data )
     {
        swish_docinfo_debug( parse_data->docinfo );
        swish_token_list_debug( parse_data->token_iterator );
        swish_nb_debug( parse_data->properties, "Property" );
        swish_nb_debug( parse_data->metanames, "MetaName" );
     }

    IMPORTANT: After the handler function is called, all the structures referenced by the swish_ParserData object are automatically freed, so if you intend to keep any of the data for storing in an index, you will need to strdup() words, properties, docinfo, etc. as part of your indexing code.

    See the example swish_lint.c file for how to create and pass in a handler function pointer to the swish_3_init() constructor.

    Configuration API

    Configuration is different with libswish3 than with Swish-e. The biggest change is that libswish3 configuration files are written in XML. This is done for several reasons:

    • 1

      Since libswish3 already has a powerful XML parser built-in, it's much easier to parse a configuration file written in XML than to port the Swish-e config parser to libswish3 .

    • 2

      libswish3 stores index header information in a XML format nearly identical to the configuration file format. So the parser needs to understand only one XML schema.

    • 3

      You can store UTF-8 text in your configuration file and it will be parsed correctly.

    • 4

      The configuration directive list is extensible. Simple key/value configuration directives can be added without any modification to the libswish3 config parser. They are simply stored in the swish_Config struct hash for your own use and amusement.

    CAUTION: The configuration directive names documented in the Directives section below are reserved for use by libswish3 . Some of them have special handling considerations (like MetaNames and PropertyNames). So the important idea to grasp with the extensible configuration feature is "simple key/value pairs."

    This section describes how to build a libswish3 configuration file.

    Configuration Example

    Here's an example libswish3 configuration file:

     <swish>
      <FollowSymLinks>yes</FollowSymLinks>
      
      <MetaNames>
       <foo bias="+10" />
       <bar bias="-5" />
       <swishtitle bias="+50" />
       <title alias_for="swishtitle" />
       <other>color size weight</other>
      </MetaNames>
      
      <PropertyNames>
       <foo type="text" ignore_case="1" />
       <bar type="int" />
       <lastmod type="date" />
       <bing ignore_case="0" />
       <description verbatim="1" max="10000" alias="body" sort_length="20" />
       <notsorted sort="0" />
      </PropertyNames>
      
      <Tokenize>1</Tokenize>
     </swish>

    And here's that same example, dissected:

     <swish>

    The top level tag.

     <FollowSymLinks>yes</FollowSymLinks>

    Equivalent to the Swish-e style:

     FollowSymLinks yes

    which simply informs whatever aggregator you are using that when confronted with a symlink on the filesystem, it should be followed.

    FollowSymLinks is an example of a simple key/value pair (see the CAUTION above).

    MetaNames

    Here's the first big difference from Swish-e. MetaNames, MetaNameAlias, and MetaNamesRank have been combined into a single XML tag with appropriate attributes.

     <foo bias="10" />

    is the same thing as (in Swish-e style):

     MetaNames foo
     MetaNamesRank 10 foo

    while:

     <swishtitle bias="50" />
     <title alias_for="swishtitle" />

    is equivalent to:

     MetaNames swishtitle
     MetaNameAlias swishtitle title
     MetaNamesRank 50 swishtitle

    You can see that the XML style allows for a terser, more compact expression. You can still assign multiple aliases to a single MetaName:

     <other>color size weight</other>

    is equivalent to:

     MetaNames other
     MetaNameAlias other color size weight

    In addition, there are some special features intended for use with HTML documents.

     <links html="1" alias_for="href" />      # same as HTMLLinksMetaName
     <images html="1" alias_for="src" />      # same as ImageLinksMetaName
     <alttext html="1" alias_for="alt" />     # same as IndexAltTagMetaName
    PropertyNames

    PropertyNames, PropertyNamesCompareCase, PropertyNamesIgnoreCase, PropertyNamesNoStripChars, PropertyNamesNumeric, PropertyNamesDate, PropertyNameAlias, PropertyNamesMaxLength, PropertyNamesSortKeyLength, StoreDescription and PreSortedIndex have all been combined into a single XML directive.

    Here's the example from above with equivalent Swish-e directives annotated:

     <foo ignore_case="1" />
     # PropertyNamesIgnoreCase foo
     <bar type="int" />
     # PropertyNamesNumeric bar
     
     <lastmod type="date" />
     # PropertyNamesDate lastmod
     
     <bing comparecase="1" />
     # PropertyNamesCompareCase bing
     
     <description verbatim="1" max="10000" alias="body" sort_length="20" />
     # PropertyNamesNoStripChars description
     # PropertyNamesMaxLength 10000 description
     # PropertyNameAlias description body
     # PropertyNamesSortKeyLength 20 description
     <notsorted sort="0" />
     # PreSortedIndex foo bar lastmod bind description

    Again, the XML format greatly simplifies the syntax. You can assign attributes as you need, though be aware that some attributes are inherently mismatched and might generate an error or unexpected behaviour:

     <foo ignore_case="1" type="int" />     # wrong
     <foo ignore_case="0" type="date" />    # wrong
     <foo verbatim="1" type="int" />        # wrong
     <foo sort="0" max="20" />              # wrong
    Directives

    The following configuration directives are currently reserved by libswish3. These are top-level tags within the > root tag. If you use other top-level tags than these, they will be treated as key/value pairs and added to the misc hash in the swish_Config struct.

    • swish_version

      Contains the value of the SWISH_VERSION constant.

    • swish_lib_version

      Contains the value of the SWISH_LIB_VERSION constant.

    • MetaNames

      Contains MetaName definitions.

    • PropertyNames

      Contains PropertyName definitions.

    • Parsers

      Contains a mapping of Parser name to MIME types. Example:

       <Parsers>
        <XML>application/xml</XML>
        <HTML>text/html</HTML>
        <TXT>text/plain</TXT>
        <XML>text/foo</XML>
        <HTML>default</HTML>
        <HTML>foo/bar</HTML>
       </Parsers>
    • MIME

      Contains a mapping of file extensions to MIME types. Example:

       <MIME>
        <au>foo/bar</au>
       </MIME>
    • Index

      Contains attributes of the inverted index. Example (with defaults):

       <Index>
        <Format>Native</Format>
        <Name>index.swish</Name>
        <Locale>en_US.UTF-8</Locale>
       </Index>
    • TagAlias

      Contains mapping of alias names to tag names.

       <TagAlias>
        <swishdescription>body</swishdescription>
        <swishtitle>title</swishtitle>
        <swishtitle>foo</swishtitle>
        <swishtitle>bar</swishtitle>
       </TagAlias>
    • Tokenize

      Toggle the tokenizer on or off. Default is yes (on).

       <Tokenize>yes</Tokenize>
       
      =item CascadeMetaContext

      Toggle the cascading effect of MetaName context. The default is no (off).

       <CascadeMetaContext>no</CascadeMetaContext>

      The effect is to consider a Token assigned to every MetaName in its DOM stack. An example:

       <doc>
        <metaone>
         foo
         <metatwo>bar</metatwo>
        </metaone>
       </doc>

      If CascadeMetaContext is true (on), then the token bar will be tracked as both MetaName metaone and metatwo . If CascadeMetaContext is false (off), then the token bar will be only metatwo .

    EXAMPLES

    See the swish_lint.c file included in the libswish3 distribution.

    FAQ
    What is IR?

    Information Retrieval.

    How is libswish3 related to Swish-e?

    libswish3 is the core parsing library for Swish-e version 3 (Swish3).

    Is libswish3 a search engine?

    No. libswish3 is a document parser. It might work well in or with any number of search engines, but it is not in itself any kind of search tool.

    So what does libswish3 DO exactly?

    libswish3 reads text, HTML and XML files and extracts just the words and document properties from each document. It then hands off the token list and properties to a handler function. Finally, it frees all the memory associated with the token list and properties.

    The handler function can do whatever you wish, though typically a handler would iterate over the words in the token list and add each one to an index using an IR library API.

    BACKGROUND

    libswish3 is part of the Swish-e project. It was born out of the need for UTF-8 and incremental indexing support and a desire to experiment with alternate indexing libraries like Lucene, KinoSearch, Xapian and Hyperestraier.

    libswish3 was developed with the idea that many quality IR libraries already exist, but few if any provide an easy and fast way of preparing documents for indexing. The following assumptions informed the development of libswish3.

    The IR Toolchain

    A decent IR toolchain requires 5 parts:

    • aggregator

      Collects documents from a filesystem, database, website or other sources.

    • filter

      Normalizes documents to a standard format (plain text or a delimited/markup like YAML, HTML or XML) for indexing.

    • parser

      Breaks a document into a list of words, including their context and position.

    • indexer

      Writes the list of words in a storage system for quick, efficient retrieval.

    • searcher

      Parses queries and fetches data from the indexer's storage system.

    Of course, the division between these parts is not always clean or apparent. Parsing search queries, for example, will necessarily involve elements of the parser and searcher components, while the indexer and searcher are of necessity intrinsically bound.

    But any complete IR system will have these five parts in some combination.

    Swish-e aggregators and filters are already good

    The existing Swish-e document aggregators ( DirTree.pl and spider.pl ) and filtering system ( SWISH::Filter ) are good. They are all written in Perl and are easily modified, and they have ample configuration options and documentation.

    Why reinvent the wheel?

    Several good IR libraries exist that provide an indexer and searcher. These libraries do UTF-8, incremental indexing, and have search syntax on par with (or better than) Swish-e 2.x. Examples include Xapian, KinoSearch and Lucene. While they might be a little slower than Swish-e (at least in terms of indexing speed) they make up that for with:

    • well-documented APIs

    • bindings in a variety of programming languages

    • active development communities

    • the flexibility that comes with being a library instead of a fixed program

    The missing link

    The piece that Swish-e provides that other IR libraries lack is a fast, stable, integrated document parser. Xapian has Omega, but it does not parse XML, nor does it recognize ad hoc word context (metanames).

    However, the Swish-e 2.x parser does not work independently of the Swish-e indexer and searcher, nor does it support UTF-8.

    One piece is missing: a parser that works with the Swish-e aggregator/filter system, supports UTF-8, and offers flexible options for connecting with other IR libraries.

    Ergo, libswish3: a document parser compatible with the existing Swish-e -S prog API and capable of generating UTF-8 token lists for indexing with a variety of IR libraries.

    Where does libswish3 fit?

    libswish3 is the core C library in Swish3.

    However, libswish3 may be used without the rest of the Swish3. The assumption is that libswish3 could fit into an IR toolchain like this:

     aggregator -> filter -> libswish3 -> some IR library

    You could then use the native search API of the IR library.

    For example, you might use the Swish-e spider.pl script to spider a website, filtering documents with SWISH::Filter and then handing the output to a libswish3 -based program that will parse the documents into words and store the data in a Xapian or KinoSearch index (or both!). That model is, in fact, what Swish3 does.

    Or you might use the SWISH::Prog Perl module (from the CPAN) to build your own aggregator/filter system, then hand the output to libswish3.

    AUTHOR

    Peter Karman (peter@peknet.com).

    CREDITS

    libswish3 is inspired by code from Swish-e (http://www.swish-e.org), Libxml2 (http://www.xmlsoft.org), Apache (http://www.apache.org), Rahul Dhesi (http://www.tug.org/tex-archive/tools/zoo/), Angel Ortega (http://www.triptico.com/software/unicode.html), James Henstridge (http://www.jamesh.id.au/articles/libxml-sax/libxml-sax.html), YoLinux (http://www.yolinux.com/TUTORIALS/GnomeLibXml2.html) and no doubt many unnamed others.

    All mistakes, errors and poor programming choices are, however, those of the author.

    LICENSE

    libswish3 is licensed under the GPL.

    libswish3 is free software; you can redistribute it and/or modify it under the terms of the GNU Library General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version.

    libswish3 is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Library General Public License for more details.

    You should have received a copy of the GNU Library General Public License along with libswish3; see the file COPYING. If not, write to the

     Free Software Foundation, Inc.
     59 Temple Place - Suite 330
     Boston, MA 02111-1307, USA
    SEE ALSO

    The project homepage: http://dev.swish-e.org/wiki/swish3

    swish_lint(1), swish_isw(1), swish_words(1)

    SWISH::Prog::KSx and SWISH::Prog::Xapian on CPAN

    Uploaded first pass at both implementations this last week. The announcement to the Swish-e list just went out.

    SWISH::3 on CPAN

    After 4 years of learning how to glue Perl and C together with XS and many sleepless nights, I have released SWISH::3 to the CPAN.

    <cue the sound of scattered applause>

    Mostly this is a triumph of longevity rather than quality code. It's taken me this long to get something workable.

    swish_xapian

    The Xapian backend for Swish3 has been getting some love lately. The swish_xapian command line tool has most of the features now that swish-e v2.x does.

    I've posted about it on the Swish-e wiki.