Projects

In this area you'll find other things I'm working on and/or interested in.

But have I mentioned my beautiful son lately?

make test

Invoking make test in a project and watching as 1000s of successful tests scroll by, culminating in the All tests successful. message, gives me the same thrill of satisfaction as when I used to paint houses, and having finished a long day of sweaty labor at sanding and chipping old paint off, I could stand back and survey the structure, primed and ready for a fresh coat of paint. It's the anticipation that thrills, in the same way that a trip to the grocery store and a full fridge, or several loads of clean laundry folded and stowed safely away in drawers, thrills me. The knowing that I am prepared, belt cinched tight, all tests successful.

Frozen Perl 2010

It's been a long week, culminating today in Frozen Perl 2010, a Perl conference for and by Perl hackers, here in the Twin Cities. I gave two talks at today's conference, one on Swish3 and the other on Devel::NYTProf and Search::Tools. Both talks seemed well-received.

In the process of preparing the talks I also released a few new, related modules to CPAN this week:

Search::OpenSearch
OpenSearch server glue for KinoSearch and Swish-e 2.x via SWISH::Prog. There's a demo Plack app and ExtJS, using both search engines as part of the slides for my Swish3 talk.

I think OpenSearch is very cool and look forward to doing more with that spec, including adding more features (e.g. facets) to Search::OpenSearch.

Search::Query
Search::Query now has support for SQL and SWISH Dialects. I hope to add KinoSearch and Xapian dialects soon. The Search::Query::Parser now has (undocumented and experimental) support for range queries, so that you can say:
foo=( 1..4 )
and that'll be expanded to
foo=( 1 OR 2 OR 3 OR 4 )
when the Dialect query object is stringified. Handy for things like ranges of dates, which is how I am using it as $work.
Search::Tools, SWISH::API::*
New releases of these older modules as well, with some bug fixes and refactoring to support the Search::Query.
So, yes. A busy week.

I enjoyed hearing other folks' talks today at Frozen Perl. There was a good variety: pack/unpack, Unicode, i18n and best practice-related presentations. I met some new people, renewed friendships with folks I already knew, and drank lots of free coffee. The cookies were good too.

libswish3.3
NAME

libswish3 - Swish3 C library

SYNOPSIS
Data Structures
 
 struct swish_3
 {
     int             ref_cnt;
     void           *stash;
     swish_Config   *config;
     swish_Analyzer *analyzer;
     swish_Parser   *parser;
 };
 
 struct swish_StringList
 {
     unsigned int    n;
     unsigned int    max;
     xmlChar**       word;
 };
 
 
 struct swish_Config
 {
     int                          ref_cnt;
     void                        *stash;      /* for bindings */
     xmlHashTablePtr              misc;
     xmlHashTablePtr              properties;
     xmlHashTablePtr              metanames;
     xmlHashTablePtr              tag_aliases;
     xmlHashTablePtr              parsers;
     xmlHashTablePtr              mimes;
     xmlHashTablePtr              index;
     xmlHashTablePtr              stringlists;
     struct swish_ConfigFlags    *flags;      /* shortcuts for parsing */
 };
 
 struct swish_ConfigFlags
 {
     boolean         tokenize;
     boolean         cascade_meta_context;
     xmlHashTablePtr meta_ids;
     xmlHashTablePtr prop_ids;
     //xmlHashTablePtr contexts;
 };
 
 struct swish_NamedBuffer
 {
     int             ref_cnt;    /* for bindings */
     void           *stash;      /* for bindings */
     xmlHashTablePtr hash;       /* the meat */
 };
 
 struct swish_DocInfo
 {
     time_t              mtime;
     off_t               size;
     xmlChar *           mime;
     xmlChar *           encoding;
     xmlChar *           uri;
     unsigned int        nwords;
     xmlChar *           ext;
     xmlChar *           parser;
     xmlChar *           action;
     boolean             is_gzipped;
     int                 ref_cnt;
 };
 
 struct swish_MetaName
 {
     int                 ref_cnt;
     int                 id;
     xmlChar            *name;
     int                 bias;
     xmlChar            *alias_for;
 };
 
 struct swish_Property
 {
     int                 ref_cnt;
     int                 id;
     xmlChar            *name;
     boolean             ignore_case;
     int                 type;
     boolean             verbatim;
     xmlChar            *alias_for;
     unsigned int        max;
     boolean             sort;
     boolean             presort;
     unsigned int        sort_length;
 };
 
 struct swish_Token
 {
     unsigned int        pos;            // this token's position in document
     swish_MetaName     *meta;
     xmlChar            *value;
     xmlChar            *context;
     unsigned int        offset;
     unsigned int        len;
     int                 ref_cnt;
 };
 
 struct swish_TokenList
 {
     unsigned int        n;
     unsigned int        pos;            // track position in document
     xmlHashTablePtr     contexts;       // cache contexts
     xmlBufferPtr        buf;
     swish_Token**       tokens;
     int                 ref_cnt;
 };
 
 struct swish_TokenIterator
 {
     swish_TokenList     *tl;
     swish_Analyzer      *a;
     unsigned int         pos;           // position in iteration
     int                  ref_cnt;
 };
 
 struct swish_Tag
 {
     xmlChar            *raw;            // tag as libxml2 sees it
     xmlChar            *baked;          // tag as libswish3 sees it
     xmlChar            *context;
     struct swish_Tag   *next;
     unsigned int        n;
 };
 
 struct swish_TagStack
 {
     swish_Tag         *head;
     swish_Tag         *temp;
     unsigned int       count;
     char              *name;       // debugging aid -- name of the stack
 };
 
 struct swish_Analyzer
 {
     unsigned int           maxwordlen;         // max word length
     unsigned int           minwordlen;         // min word length
     boolean                tokenize;           // should we parse into TokenList
     int                  (*tokenizer) (swish_TokenIterator*, xmlChar*, swish_MetaName*, xmlChar*);
     xmlChar*             (*stemmer)   (xmlChar*);
     boolean                lc;                 // should tokens be lowercased
     void                  *stash;              // for script bindings
     void                  *regex;              // optional regex
     int                    ref_cnt;            // for script bindings
 };
 
 struct swish_Parser
 {
     int                    ref_cnt;             // for script bindings
     void                 (*handler)(swish_ParserData*); // handler reference
     void                  *stash;               // for script bindings
     int                    verbosity;           
 };
 
 struct swish_ParserData
 {
     swish_3               *s3;                 // main object
     xmlBufferPtr           meta_buf;           // tmp MetaName buffer
     xmlBufferPtr           prop_buf;           // tmp Property buffer
     xmlChar               *tag;                // current tag name
     swish_DocInfo         *docinfo;            // document-specific properties
     boolean                no_index;           // toggle flag. should buffer be indexed.
     boolean                is_html;            // shortcut flag for html parser
     boolean                bump_word;          // boolean for moving word position/adding space
     unsigned int           offset;             // current offset position
     swish_TagStack        *metastack;          // stacks for tracking the tag => metaname
     swish_TagStack        *propstack;          // stacks for tracking the tag => property
     swish_TagStack        *domstack;           // stacks for tracking xml/html dom tree
     xmlParserCtxtPtr       ctxt;               // so we can free at end
     swish_TokenIterator   *token_iterator;     // token container
     swish_NamedBuffer     *properties;         // buffer all properties
     swish_NamedBuffer     *metanames;          // buffer all metanames
 };
 
 
Global Functions
 void            swish_setup();
 const char *    swish_lib_version();
 const char *    swish_libxml2_version();
 
Top-Level Functions
 swish_3 *       swish_3_init( void (*handler) (swish_ParserData *), void *stash );
 void            swish_3_free( swish_3 *s3 );
 int             swish_parse_file( swish_3 * s3, xmlChar *filename);
 unsigned int    swish_parse_fh( swish_3 * s3, FILE * fh);
 int             swish_parse_buffer( swish_3 * s3, xmlChar * buf);
 unsigned int    swish_parse_directory( swish_3 *s3, xmlChar *dir, boolean follow_symlinks );
 
I/O Functions
 xmlChar *   swish_io_slurp_fh( FILE * fh, unsigned long flen, boolean binmode );
 xmlChar *   swish_io_slurp_file_len( xmlChar *filename, off_t flen, boolean binmode );
 xmlChar *   swish_io_slurp_gzfile_len( xmlChar *filename, off_t flen, boolean binmode );
 xmlChar *   swish_io_slurp_file( xmlChar *filename, off_t flen, boolean is_gzipped, boolean binmode );
 long int    swish_io_count_operable_file_lines( xmlChar *filename );
 boolean     swish_io_is_skippable_line( xmlChar *str );
 
Filesystem Functions
 boolean     swish_fs_file_exists( xmlChar *filename );
 boolean     swish_fs_is_dir( xmlChar *path );
 boolean     swish_fs_is_file( xmlChar *path );
 boolean     swish_fs_is_link( xmlChar *path );
 off_t       swish_fs_get_file_size( xmlChar *path );
 time_t      swish_fs_get_file_mtime( xmlChar *path );
 xmlChar *   swish_fs_get_file_ext( xmlChar *url );
 boolean     swish_fs_looks_like_gz( xmlChar *file );
 
Hash Functions
 int         swish_hash_add( xmlHashTablePtr hash, xmlChar *key, void * value );
 int         swish_hash_replace( xmlHashTablePtr hash, xmlChar *key, void *value );
 int         swish_hash_delete( xmlHashTablePtr hash, xmlChar *key );
 boolean     swish_hash_exists( xmlHashTablePtr hash, xmlChar *key );
 int         swish_hash_exists_or_add( xmlHashTablePtr hash, xmlChar *key, xmlChar *value );
 void        swish_hash_merge( xmlHashTablePtr hash1, xmlHashTablePtr hash2 );
 void *      swish_hash_fetch( xmlHashTablePtr hash, xmlChar *key );
 void        swish_hash_dump( xmlHashTablePtr hash, const char *label );
 xmlHashTablePtr swish_hash_init(int size);
 void        swish_hash_free( xmlHashTablePtr hash );
 
Memory Functions
 void        swish_mem_init();
 void *      swish_xrealloc(void *ptr, size_t size);
 void *      swish_xmalloc( size_t size );
 void        swish_xfree( void *ptr );
 void        swish_mem_debug();
 long int    swish_memcount_get();
 void        swish_memcount_dec();
 xmlChar *   swish_xstrdup( const xmlChar * ptr );
 xmlChar *   swish_xstrndup( const xmlChar * ptr, int len );
 
Time Functions
 double      swish_time_elapsed(void);
 double      swish_time_cpu(void);
 char *      swish_time_print(double time);
 char *      swish_time_print_fine(double time);
 char *      swish_time_format(time_t epoch);
 
Error Functions
 void        swish_set_error_handle( FILE *where );
 void        swish_croak(const char *file, int line, const char *func, const char *msg,...);
 void        swish_warn(const char *file, int line, const char *func, const char *msg,...);
 void        swish_debug(const char *file, int line, const char *func, const char *msg,...);
 
String Functions
 char *              swish_get_locale();
 void                swish_verify_utf8_locale();
 boolean             swish_is_ascii( xmlChar *str );
 int                 swish_bytes_in_wchar( int wchar );
 int                 swish_utf8_chr_len( xmlChar *utf8 );
 uint32_t            swish_utf8_codepoint( xmlChar *utf8 );
 int                 swish_utf8_num_chrs( xmlChar *utf8 );
 void                swish_utf8_next_chr( xmlChar *s, int *i );
 void                swish_utf8_prev_chr( xmlChar *s, int *i );
 xmlChar *           swish_str_escape_utf8( xmlChar *utf8 );
 xmlChar *           swish_str_unescape_utf8( xmlChar *ascii );
 wchar_t *           swish_locale_to_wchar(xmlChar * str);
 xmlChar *           swish_wchar_to_locale(wchar_t * str);
 wchar_t *           swish_wstr_tolower(wchar_t *s);
 xmlChar *           swish_str_tolower(xmlChar *s );
 xmlChar *           swish_utf8_str_tolower(xmlChar *s);
 xmlChar *           swish_ascii_str_tolower(xmlChar *s);
 xmlChar *           swish_str_skip_ws(xmlChar *s);
 void                swish_str_trim_ws(xmlChar *string);
 void                swish_str_ctrl_to_ws(xmlChar *s);
 boolean             swish_str_all_ws(xmlChar * s);
 boolean             swish_str_all_ws_len(xmlChar * s, int len);
 void                swish_debug_wchars( const wchar_t * widechars );
 int                 swish_wchar_t_comp(const void *s1, const void *s2);
 int                 swish_sort_wchar(wchar_t *s);
 swish_StringList *  swish_stringlist_build(xmlChar *line);
 swish_StringList *  swish_stringlist_init();
 void                swish_stringlist_free(swish_StringList *sl);
 unsigned int        swish_stringlist_add_string(swish_StringList *sl, xmlChar *str);
 void                swish_stringlist_merge(swish_StringList *sl1, swish_StringList *sl2);
 swish_StringList *  swish_stringlist_copy(swish_StringList *sl);
 swish_StringList *  swish_stringlist_parse_sort_string(xmlChar *sort_string, swish_Config *cfg);
 void                swish_stringlist_debug(swish_StringList *sl);
 int                 swish_string_to_int( char *buf );
 boolean             swish_string_to_boolean( char *buf );
 xmlChar *           swish_int_to_string( int val );
 xmlChar *           swish_long_to_string( long val );
 xmlChar *           swish_double_to_string( double val );
 xmlChar *           swish_date_to_string( int y, int m, int d );
 char                swish_get_C_escaped_char(xmlChar *s, xmlChar **se);
 
Configuration Functions
 swish_Config *      swish_config_init();
 void                swish_config_set_default( swish_Config *config );
 void                swish_config_merge( swish_Config *config1, swish_Config *config2 );
 swish_Config *      swish_config_add( swish_Config * config, xmlChar * conf );
 swish_Config *      swish_config_parse( swish_Config * config, xmlChar * conf );
 void                swish_config_debug( swish_Config * config );
 void                swish_config_free( swish_Config * config);
 xmlHashTablePtr     swish_mime_defaults();
 xmlChar *           swish_mime_get_type( swish_Config * config, xmlChar * fileext );
 xmlChar *           swish_mime_get_parser( swish_Config * config, xmlChar *mime );
 void                swish_config_test_alias_fors( swish_Config *c );
 swish_ConfigFlags * swish_config_flags_init();
 void                swish_config_flags_free( swish_ConfigFlags *flags );
 void                swish_config_test_alias_fors( swish_Config *config );
 void                swish_config_test_unique_ids( swish_Config *config );
 
 
Parser Functions
 swish_Parser *  swish_parser_init( void (*handler) (swish_ParserData *) );
 void            swish_parser_free( swish_Parser * parser );
 
Token Functions
 swish_TokenList *   swish_token_list_init();
 void                swish_token_list_free( swish_TokenList *tl );
 int                 swish_token_list_add_token(    
                                         swish_TokenList *tl, 
                                         xmlChar *token,
                                         int token_len,
                                         swish_MetaName *meta,
                                         xmlChar *context );
 int                 swish_token_list_set_token(
                                         swish_TokenList *tl,
                                         xmlChar *token,
                                         int len );
 swish_Token *       swish_token_init();
 void                swish_token_free( swish_Token *t );
 swish_TokenIterator *swish_token_iterator_init( swish_Analyzer *a );
 void                swish_token_iterator_free( swish_TokenIterator *ti );
 swish_Token *       swish_token_iterator_next_token( swish_TokenIterator *it );
 int                 swish_tokenize(     swish_TokenIterator *ti, 
                                         xmlChar *buf, 
                                         swish_MetaName *meta,
                                         xmlChar *context );
 int                 swish_tokenize_ascii(    
                                         swish_TokenIterator *ti, 
                                         xmlChar *buf, 
                                         swish_MetaName *meta,
                                         xmlChar *context );
 int                 swish_tokenize_utf8(    
                                         swish_TokenIterator *ti, 
                                         xmlChar *buf, 
                                         swish_MetaName *meta,
                                         xmlChar *context );
 void                swish_token_list_debug( swish_TokenIterator *it );
 xmlChar *           swish_token_list_get_token_value( swish_TokenList *tl, swish_Token *t );
 void                swish_token_debug( swish_Token *t );
 
 
Analyzer Functions
 swish_Analyzer *    swish_analyzer_init( swish_Config * config );
 void                swish_analyzer_free( swish_Analyzer * analyzer );
 
DocInfo Functions
 swish_DocInfo *     swish_docinfo_init();
 void                swish_docinfo_free( swish_DocInfo * ptr );
 int                 swish_docinfo_check(swish_DocInfo * docinfo, swish_Config * config);
 int                 swish_docinfo_from_filesystem(  xmlChar *filename, 
                                                     swish_DocInfo * i, 
                                                     swish_ParserData *parser_data );
 void                swish_docinfo_debug( swish_DocInfo * docinfo );
 
Buffer Functions
 swish_NamedBuffer * swish_nb_init( xmlHashTablePtr confhash );
 void                swish_nb_free( swish_NamedBuffer * nb );
 void                swish_nb_debug( swish_NamedBuffer * nb, xmlChar * label );
 void                swish_nb_add_buf( swish_NamedBuffer *nb, 
                                       xmlChar * name,
                                       xmlBufferPtr buf, 
                                       xmlChar * joiner,
                                       boolean cleanwsp,
                                       boolean autovivify);
 void                swish_nb_add_str(   swish_NamedBuffer * nb, 
                                         xmlChar * name, 
                                         xmlChar * str,
                                         unsigned int len,
                                         xmlChar * joiner,
                                         boolean cleanwsp,
                                         boolean autovivify);
 void                swish_buffer_append( xmlBufferPtr buf, xmlChar * txt, int len );
 xmlChar*            swish_nb_get_value( swish_NamedBuffer* nb, xmlChar* key );
 
Property Functions
 swish_Property *    swish_property_init( xmlChar *propname );
 void                swish_property_free( swish_Property *prop );
 void                swish_property_debug( swish_Property *prop );
 int                 swish_property_get_id( xmlChar *propname, xmlHashTablePtr properties );
 
MetaName Functions
 swish_MetaName *    swish_metaname_init( xmlChar *name);
 void                swish_metaname_free( swish_MetaName *m );
 void                swish_metaname_debug( swish_MetaName *m );
 
Header Functions
 boolean             swish_header_validate(char *filename);
 boolean             swish_header_merge(char *filename, swish_Config *c);
 swish_Config *      swish_header_read(char *filename);
 void                swish_header_write(char* filename, swish_Config* config);
 
DESCRIPTION

libswish3 is the core C library of Swish3 .

libswish3 uses the GNOME Libxml2 library to parse words and metadata from XML, HTML and plain text files. libswish3 supports full UTF-8 encoding.

libswish3 is a parsing tool for use with information retrieval (IR) libraries. Dynamic language bindings are available in the source distribution in the bindings directory.

APIs

The following APIs are defined:

Parsing API

libswish3 provides three basic input functions:

  • swish_parse_file()

  • swish_parse_fh()

  • swish_parse_buffer()

Each of these functions takes a swish_Parser struct pointer and optional user_data .

In addition:

  • The swish_parse_file() function takes a file path, which must be a valid file. Directories and links are not supported. The assumption is that you will use your calling code to recurse through directories and handle links.

  • swish_parse_buffer() takes a string representing the document headers and the full text of the document.

  • swish_parse_fh() takes a filehandle pointer, which if set to NULL, defaults to stdin.

See the Headers API section for more information on using swish_parse_fh() and swish_parse_buffer().

See the handler Function section for more information on how to deal with the data extracted by each of the swish_parse_* functions.

Headers API

The Headers API supports and extends the Swish-e -S prog feature, which allows you to feed the indexer with output from another prog ram. The API has been extended from Swish-e's to allow for MIME types and more congruence with the HTTP 1.1 specification.

See SWISH-RUN documentation in the Swish-e distribution for the Swish-e version 2 headers API.

This is the libswish3 implementation. See SWISH::Prog::Headers for a simple Perl-based way of generating the proper headers.

  • Content-Location

    Swish-e name: Path-Name

    The name of the document. May be any string: an ID of a record in a database, a URL or a simple file name. The string is stored in the swish_DocInfo uri struct member, which is often used as the primary identifier of a document in an index.

    This header is required.

  • Content-Length

    The length in bytes of the document, starting after the blank line separating the headers from the document itself. The value must be exactly the length of the document, including any extra line feeds or carriage returns at the end of the document.

    Example:

     Content-Location: foo.html
     Content-Length: 9
     The doc.\n
     ^^^^^^^^ ^
     12345678 9

    The value is stored in the swish_DocInfo size struct member.

    This header is required.

  • Last-Modified

    Swish-e name: Last-Mtime

    The last modification time of the document. The value must be an integer: the seconds since the Epoch on your system.

    If not present, will default to the current time.

    The value is stored in the swish_DocInfo mtime struct member.

    This header is not required.

  • Parser-Type

    Swish-e name: Document-Type

    Explicitly name the parser used for the document, rather than defaulting to the MIME type mapping based on Content-Type and/or Content-Location . The three parser types are:

    • XML

    • HTML

    • TXT

    The Swish-e values XML2 , XML* , HTML2 , HTML* , TXT2 , TXT* are also supported for compatibility, but they map to the three libswish3 types.

    The value is stored in the swish_DocInfo parser struct member.

    If not present, the document parser will be automatically chosen based on the following logic:

    • If a Content-Type is given, the parser mapped to that MIME type will be used. You may override the default mappings in your configuration. See Configuration API .

    • If no Content-Type is given, a MIME type will be guessed at based on the file extension of the document's Content-Location , and the parser mapped to that MIME type will be used.

    • Finally, if a MIME type is not identified, the parser defined in SWISH_DEFAULT_PARSER in libswish3.h will be used.

    See also Content-Type and Content-Location .

    This header is not required.

  • Content-Type

    The MIME type of the document. The libswish3 MIME type list is based on the Apache 2.0 version. See http://www.iana.org/assignments/media-types/ for the official registry.

    If not defined with Content-Type , the MIME type will be guessed based on the file extension in the Content-Location header. If the Content-Location string does not contain a file extension (as might be the case with non-URL value), or the file extension has no MIME mapping, then the MIME type will default to SWISH_DEFAULT_MIME as defined in libswish3.h .

    You may override the default extension-to-MIME mappings in your configuration. See Configuration API .

    The value is stored in the swish_DocInfo mime struct member.

    See also Content-Location and Parser-Type .

    This header is not required.

  • Update-Mode

    NOTE: This header exists only for backwards compatibility with Swish-e's incremental index feature. It may be deprecated in a future version of libswish3.

Structures API

Writing an effective handler function requires an understanding of some of the key libswish3 data structures.

For more details on any of these structures, see the SYNOPSIS.

swish_3

The main data structure. A swish_3 object has a swish_Config, swish_Analyzer and swish_Parser object and delegates to each as appropriate.

This is typically the only object you need to create and use.

swish_Config

A configuration object. This object is required for initializing both a swish_Analyzer object and a swish_Parser object.

swish_Parser

A parser object. Required for executing any of the three swish_parse_* functions.

swish_ParserData

A parser data object. This object is passed around internally by the libxml2 SAX2 handlers, and is eventually the object passed to the handler function pointer. See The handler Function .

swish_TokenList

A list of words or tokens. The object is typically accessed via a swish_TokenIterator, like this:

    // example of swish_TokenIterator
    swish_Token *t;
    while ((t = swish_next_token(token_iterator)) != NULL) {
        SWISH_DEBUG_MSG("\n\
        t->ref_cnt      = %d\n\
        t->pos          = %d\n\
        t->context      = %s\n\
        t->meta         = %d [%s]\n\
        t->len          = %d\n\
        t->value        = %s\n\
    ", t->ref_cnt, t->pos, t->context, t->meta->id, t->meta->name, t->len, t->value);
    }

See the swish_debug_token_list() function from which the code above is taken.

swish_Token

An object representing one word or token. The word's position relative to other words, length, tag context and MetaName are all available in the object.

swish_DocInfo

An object describing metadata about the document itself: URI, MIME type, size, etc.

swish_Analyzer

The Analyzer object controls how the character content of a document is parsed: whether or not a WordList is created with a tokenizer, if the words (tokens) are lowercased or stemmed, etc.

The handler Function

The handler function pointer is the final link in the parsing chain. The function pointer is set in the swish_Parser object constructor, and is called by each of the swish_parse_* functions after the entire document has been parsed and (optionally) tokenized.

The handler receives one argument: a swish_ParserData object containing all the metadata and words in the document.

If all you wanted to do was print out a report about each document as it was parsed, your handler function might be as simple as:

 void
 my_handler( swish_ParserData * parse_data )
 {
    swish_docinfo_debug( parse_data->docinfo );
    swish_token_list_debug( parse_data->token_iterator );
    swish_nb_debug( parse_data->properties, "Property" );
    swish_nb_debug( parse_data->metanames, "MetaName" );
 }

IMPORTANT: After the handler function is called, all the structures referenced by the swish_ParserData object are automatically freed, so if you intend to keep any of the data for storing in an index, you will need to strdup() words, properties, docinfo, etc. as part of your indexing code.

See the example swish_lint.c file for how to create and pass in a handler function pointer to the swish_3_init() constructor.

Configuration API

Configuration is different with libswish3 than with Swish-e. The biggest change is that libswish3 configuration files are written in XML. This is done for several reasons:

  • 1

    Since libswish3 already has a powerful XML parser built-in, it's much easier to parse a configuration file written in XML than to port the Swish-e config parser to libswish3 .

  • 2

    libswish3 stores index header information in a XML format nearly identical to the configuration file format. So the parser needs to understand only one XML schema.

  • 3

    You can store UTF-8 text in your configuration file and it will be parsed correctly.

  • 4

    The configuration directive list is extensible. Simple key/value configuration directives can be added without any modification to the libswish3 config parser. They are simply stored in the swish_Config struct hash for your own use and amusement.

CAUTION: The configuration directive names documented in the Directives section below are reserved for use by libswish3 . Some of them have special handling considerations (like MetaNames and PropertyNames). So the important idea to grasp with the extensible configuration feature is "simple key/value pairs."

This section describes how to build a libswish3 configuration file.

Configuration Example

Here's an example libswish3 configuration file:

 <swish>
  <FollowSymLinks>yes</FollowSymLinks>
  
  <MetaNames>
   <foo bias="+10" />
   <bar bias="-5" />
   <swishtitle bias="+50" />
   <title alias_for="swishtitle" />
   <other>color size weight</other>
  </MetaNames>
  
  <PropertyNames>
   <foo type="text" ignore_case="1" />
   <bar type="int" />
   <lastmod type="date" />
   <bing ignore_case="0" />
   <description verbatim="1" max="10000" alias="body" sort_length="20" />
   <notsorted sort="0" />
  </PropertyNames>
  
  <Tokenize>1</Tokenize>
 </swish>

And here's that same example, dissected:

 <swish>

The top level tag.

 <FollowSymLinks>yes</FollowSymLinks>

Equivalent to the Swish-e style:

 FollowSymLinks yes

which simply informs whatever aggregator you are using that when confronted with a symlink on the filesystem, it should be followed.

FollowSymLinks is an example of a simple key/value pair (see the CAUTION above).

MetaNames

Here's the first big difference from Swish-e. MetaNames, MetaNameAlias, and MetaNamesRank have been combined into a single XML tag with appropriate attributes.

 <foo bias="10" />

is the same thing as (in Swish-e style):

 MetaNames foo
 MetaNamesRank 10 foo

while:

 <swishtitle bias="50" />
 <title alias_for="swishtitle" />

is equivalent to:

 MetaNames swishtitle
 MetaNameAlias swishtitle title
 MetaNamesRank 50 swishtitle

You can see that the XML style allows for a terser, more compact expression. You can still assign multiple aliases to a single MetaName:

 <other>color size weight</other>

is equivalent to:

 MetaNames other
 MetaNameAlias other color size weight

In addition, there are some special features intended for use with HTML documents.

 <links html="1" alias_for="href" />      # same as HTMLLinksMetaName
 <images html="1" alias_for="src" />      # same as ImageLinksMetaName
 <alttext html="1" alias_for="alt" />     # same as IndexAltTagMetaName
PropertyNames

PropertyNames, PropertyNamesCompareCase, PropertyNamesIgnoreCase, PropertyNamesNoStripChars, PropertyNamesNumeric, PropertyNamesDate, PropertyNameAlias, PropertyNamesMaxLength, PropertyNamesSortKeyLength, StoreDescription and PreSortedIndex have all been combined into a single XML directive.

Here's the example from above with equivalent Swish-e directives annotated:

 <foo ignore_case="1" />
 # PropertyNamesIgnoreCase foo
 <bar type="int" />
 # PropertyNamesNumeric bar
 
 <lastmod type="date" />
 # PropertyNamesDate lastmod
 
 <bing comparecase="1" />
 # PropertyNamesCompareCase bing
 
 <description verbatim="1" max="10000" alias="body" sort_length="20" />
 # PropertyNamesNoStripChars description
 # PropertyNamesMaxLength 10000 description
 # PropertyNameAlias description body
 # PropertyNamesSortKeyLength 20 description
 <notsorted sort="0" />
 # PreSortedIndex foo bar lastmod bind description

Again, the XML format greatly simplifies the syntax. You can assign attributes as you need, though be aware that some attributes are inherently mismatched and might generate an error or unexpected behaviour:

 <foo ignore_case="1" type="int" />     # wrong
 <foo ignore_case="0" type="date" />    # wrong
 <foo verbatim="1" type="int" />        # wrong
 <foo sort="0" max="20" />              # wrong
Directives

The following configuration directives are currently reserved by libswish3. These are top-level tags within the > root tag. If you use other top-level tags than these, they will be treated as key/value pairs and added to the misc hash in the swish_Config struct.

  • swish_version

    Contains the value of the SWISH_VERSION constant.

  • swish_lib_version

    Contains the value of the SWISH_LIB_VERSION constant.

  • MetaNames

    Contains MetaName definitions.

  • PropertyNames

    Contains PropertyName definitions.

  • Parsers

    Contains a mapping of Parser name to MIME types. Example:

     <Parsers>
      <XML>application/xml</XML>
      <HTML>text/html</HTML>
      <TXT>text/plain</TXT>
      <XML>text/foo</XML>
      <HTML>default</HTML>
      <HTML>foo/bar</HTML>
     </Parsers>
  • MIME

    Contains a mapping of file extensions to MIME types. Example:

     <MIME>
      <au>foo/bar</au>
     </MIME>
  • Index

    Contains attributes of the inverted index. Example (with defaults):

     <Index>
      <Format>Native</Format>
      <Name>index.swish</Name>
      <Locale>en_US.UTF-8</Locale>
     </Index>
  • TagAlias

    Contains mapping of alias names to tag names.

     <TagAlias>
      <swishdescription>body</swishdescription>
      <swishtitle>title</swishtitle>
      <swishtitle>foo</swishtitle>
      <swishtitle>bar</swishtitle>
     </TagAlias>
  • Tokenize

    Toggle the tokenizer on or off. Default is yes (on).

     <Tokenize>yes</Tokenize>
     
    =item CascadeMetaContext

    Toggle the cascading effect of MetaName context. The default is no (off).

     <CascadeMetaContext>no</CascadeMetaContext>

    The effect is to consider a Token assigned to every MetaName in its DOM stack. An example:

     <doc>
      <metaone>
       foo
       <metatwo>bar</metatwo>
      </metaone>
     </doc>

    If CascadeMetaContext is true (on), then the token bar will be tracked as both MetaName metaone and metatwo . If CascadeMetaContext is false (off), then the token bar will be only metatwo .

EXAMPLES

See the swish_lint.c file included in the libswish3 distribution.

FAQ
What is IR?

Information Retrieval.

How is libswish3 related to Swish-e?

libswish3 is the core parsing library for Swish-e version 3 (Swish3).

Is libswish3 a search engine?

No. libswish3 is a document parser. It might work well in or with any number of search engines, but it is not in itself any kind of search tool.

So what does libswish3 DO exactly?

libswish3 reads text, HTML and XML files and extracts just the words and document properties from each document. It then hands off the token list and properties to a handler function. Finally, it frees all the memory associated with the token list and properties.

The handler function can do whatever you wish, though typically a handler would iterate over the words in the token list and add each one to an index using an IR library API.

BACKGROUND

libswish3 is part of the Swish-e project. It was born out of the need for UTF-8 and incremental indexing support and a desire to experiment with alternate indexing libraries like Lucene, KinoSearch, Xapian and Hyperestraier.

libswish3 was developed with the idea that many quality IR libraries already exist, but few if any provide an easy and fast way of preparing documents for indexing. The following assumptions informed the development of libswish3.

The IR Toolchain

A decent IR toolchain requires 5 parts:

  • aggregator

    Collects documents from a filesystem, database, website or other sources.

  • filter

    Normalizes documents to a standard format (plain text or a delimited/markup like YAML, HTML or XML) for indexing.

  • parser

    Breaks a document into a list of words, including their context and position.

  • indexer

    Writes the list of words in a storage system for quick, efficient retrieval.

  • searcher

    Parses queries and fetches data from the indexer's storage system.

Of course, the division between these parts is not always clean or apparent. Parsing search queries, for example, will necessarily involve elements of the parser and searcher components, while the indexer and searcher are of necessity intrinsically bound.

But any complete IR system will have these five parts in some combination.

Swish-e aggregators and filters are already good

The existing Swish-e document aggregators ( DirTree.pl and spider.pl ) and filtering system ( SWISH::Filter ) are good. They are all written in Perl and are easily modified, and they have ample configuration options and documentation.

Why reinvent the wheel?

Several good IR libraries exist that provide an indexer and searcher. These libraries do UTF-8, incremental indexing, and have search syntax on par with (or better than) Swish-e 2.x. Examples include Xapian, KinoSearch and Lucene. While they might be a little slower than Swish-e (at least in terms of indexing speed) they make up that for with:

  • well-documented APIs

  • bindings in a variety of programming languages

  • active development communities

  • the flexibility that comes with being a library instead of a fixed program

The missing link

The piece that Swish-e provides that other IR libraries lack is a fast, stable, integrated document parser. Xapian has Omega, but it does not parse XML, nor does it recognize ad hoc word context (metanames).

However, the Swish-e 2.x parser does not work independently of the Swish-e indexer and searcher, nor does it support UTF-8.

One piece is missing: a parser that works with the Swish-e aggregator/filter system, supports UTF-8, and offers flexible options for connecting with other IR libraries.

Ergo, libswish3: a document parser compatible with the existing Swish-e -S prog API and capable of generating UTF-8 token lists for indexing with a variety of IR libraries.

Where does libswish3 fit?

libswish3 is the core C library in Swish3.

However, libswish3 may be used without the rest of the Swish3. The assumption is that libswish3 could fit into an IR toolchain like this:

 aggregator -> filter -> libswish3 -> some IR library

You could then use the native search API of the IR library.

For example, you might use the Swish-e spider.pl script to spider a website, filtering documents with SWISH::Filter and then handing the output to a libswish3 -based program that will parse the documents into words and store the data in a Xapian or KinoSearch index (or both!). That model is, in fact, what Swish3 does.

Or you might use the SWISH::Prog Perl module (from the CPAN) to build your own aggregator/filter system, then hand the output to libswish3.

AUTHOR

Peter Karman (peter@peknet.com).

CREDITS

libswish3 is inspired by code from Swish-e (http://www.swish-e.org), Libxml2 (http://www.xmlsoft.org), Apache (http://www.apache.org), Rahul Dhesi (http://www.tug.org/tex-archive/tools/zoo/), Angel Ortega (http://www.triptico.com/software/unicode.html), James Henstridge (http://www.jamesh.id.au/articles/libxml-sax/libxml-sax.html), YoLinux (http://www.yolinux.com/TUTORIALS/GnomeLibXml2.html) and no doubt many unnamed others.

All mistakes, errors and poor programming choices are, however, those of the author.

LICENSE

libswish3 is licensed under the GPL.

libswish3 is free software; you can redistribute it and/or modify it under the terms of the GNU Library General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version.

libswish3 is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Library General Public License for more details.

You should have received a copy of the GNU Library General Public License along with libswish3; see the file COPYING. If not, write to the

 Free Software Foundation, Inc.
 59 Temple Place - Suite 330
 Boston, MA 02111-1307, USA
SEE ALSO

The project homepage: http://dev.swish-e.org/wiki/swish3

swish_lint(1), swish_isw(1), swish_words(1)

The Vendor-Client Relationship

So I don't surf youtube very much. Or rather, only when my kids are wanting to watch Wallace and Gromit trailers. So I'm always waaaay behind the times. That said, this video is a riot.

Terminal Color

For the last ten years I have used the color #E3BF70#fddc8e (hex) as my terminal background color. It's a darkish amber color that is very easy on the eyes. I'm recording it here because every year or so I have to set up a new system and always have to eyeball the settings till I get something close to what I am used to.

Update: 26 Jan 2009 Here's my .Xdefaults file for my xterm under X11 on OS X.

XTerm*background: #fddc8e
XTerm*foreground: black
XTerm*faceName: monaco
XTerm*faceSize: 10
XTerm*saveLines: 10000
XTerm*scrollBar: true
XTerm*rightScrollBar: true
XTerm*jumpScroll: true
XTerm*geometry:100x40+0+0

I like Plack

Plack is a Perl Web Server written by miyagawa.

CQL

Contextual Query Language is defined by the Library of Congress. I discovered it via CQL::Parser. Brian Cassidy is involved, so it must be good.

I immediately thought "oh shit. Now my new Search::Query module feels late-to-the-party." But on further reading, I think a CQL dialect in Search::Query makes some sense.

Search::Query is a SQL::Translator-like module for free-text search. I coded it up this week after brewing the idea for some many months. I'm imagining it now as a next-generation Search::QueryParser::SQL, for contexts beyond SQL. Example: I have a query string that works with Xapian and want to convert it to one that works with Swish-e 2.x or KinoSearch. Just parse it with Search::Query::Parser and assign it a target dialect and then call $query->stringify to get the translated version out.

Perl6 and Perl5

I know the people who read this blog generally do not care about Perl at all (hi Mom!) but I spend a great deal of time writing code in the language and talking with other members of the Perl community about our common projects, and so like anyone who has lived in the Perl world for any length of time, I have an opinion about Perl6. For those not in the know, Perl5 is the current version of Perl and has been around for over 10 years. Perl6 is the next major version evolution, but it has been in development for nearly the same length of time. The problem is that 10 years is a long time for a computer language release to gestate and many folks whose opinions count (i.e. managers) see that lack of a release as a sign that Perl Is Dead and not a good choice for their next programming project. So (the argument goes) Perl6's vaporware status makes it hard for Perl5 programmers to find jobs, because the "if it ain't new it ain't sexy" ethos of technology counts for more than it should with those making the money decisions.

The real problem isn't that Perl6 hasn't been released. The real problem is the name Perl6. Perl6 is not a single executable "thing" like Perl5 is; it's an umbrella for several different projects. Right now I can sit down at just about any modern Unix-like computer and type 'perl' and write some code that runs. Perl6 doesn't work quite that way. It's a whole new language, not just a major revision to an existing language. So the version number 5 vs 6 is misleading. That's the problem. Perl is alive and well. Perl5 continues to be maintained and developed. I get lots of work done every day using it.

Matt Trout writes a nice piece about this topic, aimed at the Perl community. I applaud it.

Question as Patch

Reading through Matt Trout's blog just now I found this wonderful quote:

Because in free software a question in the form of a well thought out patch is one that almost always gets a constructive answer.

Yes. That's just it. A patch -- real, applicable code -- indicates genuine forethought and effort and I will reward that kind of conversation every time with equal effort.