I've been working on the next version of Swish-e (codename: Swish3) for about a year now. Squirreling away hours in the late evening after the kid is asleep, with one ear on the TV my wife is watching in the other room, and eyes on the screen in here. I've been learning C, UTF-8 and Perl's XS glue language. It's been a very stretching year.
This little corner of the blog will record my progress.
It's been a long week, culminating today in Frozen Perl 2010, a Perl conference for and by Perl hackers, here in the Twin Cities. I gave two talks at today's conference, one on Swish3 and the other on Devel::NYTProf and Search::Tools. Both talks seemed well-received.
In the process of preparing the talks I also released a few new, related modules to CPAN this week:
- Search::OpenSearch
- OpenSearch server glue for KinoSearch
and Swish-e 2.x via
SWISH::Prog. There's a
demo Plack app and ExtJS, using both search engines as part of the slides for my Swish3 talk.
I think OpenSearch is very cool and look forward to doing more with that spec, including adding more features (e.g. facets) to Search::OpenSearch.
- Search::Query
- Search::Query now has support for SQL and SWISH Dialects. I hope to add
KinoSearch and Xapian dialects soon. The Search::Query::Parser now has
(undocumented and experimental) support for range queries, so that you can say:
foo=( 1..4 )and that'll be expanded tofoo=( 1 OR 2 OR 3 OR 4 )when the Dialect query object is stringified. Handy for things like ranges of dates, which is how I am using it as $work.
- Search::Tools, SWISH::API::*
- New releases of these older modules as well, with some bug fixes and refactoring to support the Search::Query.
I enjoyed hearing other folks' talks today at Frozen Perl. There was a good variety: pack/unpack, Unicode, i18n and best practice-related presentations. I met some new people, renewed friendships with folks I already knew, and drank lots of free coffee. The cookies were good too.
libswish3 - Swish3 C library
struct swish_3
{
int ref_cnt;
void *stash;
swish_Config *config;
swish_Analyzer *analyzer;
swish_Parser *parser;
};
struct swish_StringList
{
unsigned int n;
unsigned int max;
xmlChar** word;
};
struct swish_Config
{
int ref_cnt;
void *stash; /* for bindings */
xmlHashTablePtr misc;
xmlHashTablePtr properties;
xmlHashTablePtr metanames;
xmlHashTablePtr tag_aliases;
xmlHashTablePtr parsers;
xmlHashTablePtr mimes;
xmlHashTablePtr index;
xmlHashTablePtr stringlists;
struct swish_ConfigFlags *flags; /* shortcuts for parsing */
};
struct swish_ConfigFlags
{
boolean tokenize;
boolean cascade_meta_context;
xmlHashTablePtr meta_ids;
xmlHashTablePtr prop_ids;
//xmlHashTablePtr contexts;
};
struct swish_NamedBuffer
{
int ref_cnt; /* for bindings */
void *stash; /* for bindings */
xmlHashTablePtr hash; /* the meat */
};
struct swish_DocInfo
{
time_t mtime;
off_t size;
xmlChar * mime;
xmlChar * encoding;
xmlChar * uri;
unsigned int nwords;
xmlChar * ext;
xmlChar * parser;
xmlChar * action;
boolean is_gzipped;
int ref_cnt;
};
struct swish_MetaName
{
int ref_cnt;
int id;
xmlChar *name;
int bias;
xmlChar *alias_for;
};
struct swish_Property
{
int ref_cnt;
int id;
xmlChar *name;
boolean ignore_case;
int type;
boolean verbatim;
xmlChar *alias_for;
unsigned int max;
boolean sort;
boolean presort;
unsigned int sort_length;
};
struct swish_Token
{
unsigned int pos; // this token's position in document
swish_MetaName *meta;
xmlChar *value;
xmlChar *context;
unsigned int offset;
unsigned int len;
int ref_cnt;
};
struct swish_TokenList
{
unsigned int n;
unsigned int pos; // track position in document
xmlHashTablePtr contexts; // cache contexts
xmlBufferPtr buf;
swish_Token** tokens;
int ref_cnt;
};
struct swish_TokenIterator
{
swish_TokenList *tl;
swish_Analyzer *a;
unsigned int pos; // position in iteration
int ref_cnt;
};
struct swish_Tag
{
xmlChar *raw; // tag as libxml2 sees it
xmlChar *baked; // tag as libswish3 sees it
xmlChar *context;
struct swish_Tag *next;
unsigned int n;
};
struct swish_TagStack
{
swish_Tag *head;
swish_Tag *temp;
unsigned int count;
char *name; // debugging aid -- name of the stack
};
struct swish_Analyzer
{
unsigned int maxwordlen; // max word length
unsigned int minwordlen; // min word length
boolean tokenize; // should we parse into TokenList
int (*tokenizer) (swish_TokenIterator*, xmlChar*, swish_MetaName*, xmlChar*);
xmlChar* (*stemmer) (xmlChar*);
boolean lc; // should tokens be lowercased
void *stash; // for script bindings
void *regex; // optional regex
int ref_cnt; // for script bindings
};
struct swish_Parser
{
int ref_cnt; // for script bindings
void (*handler)(swish_ParserData*); // handler reference
void *stash; // for script bindings
int verbosity;
};
struct swish_ParserData
{
swish_3 *s3; // main object
xmlBufferPtr meta_buf; // tmp MetaName buffer
xmlBufferPtr prop_buf; // tmp Property buffer
xmlChar *tag; // current tag name
swish_DocInfo *docinfo; // document-specific properties
boolean no_index; // toggle flag. should buffer be indexed.
boolean is_html; // shortcut flag for html parser
boolean bump_word; // boolean for moving word position/adding space
unsigned int offset; // current offset position
swish_TagStack *metastack; // stacks for tracking the tag => metaname
swish_TagStack *propstack; // stacks for tracking the tag => property
swish_TagStack *domstack; // stacks for tracking xml/html dom tree
xmlParserCtxtPtr ctxt; // so we can free at end
swish_TokenIterator *token_iterator; // token container
swish_NamedBuffer *properties; // buffer all properties
swish_NamedBuffer *metanames; // buffer all metanames
};
void swish_setup(); const char * swish_lib_version(); const char * swish_libxml2_version();
swish_3 * swish_3_init( void (*handler) (swish_ParserData *), void *stash ); void swish_3_free( swish_3 *s3 ); int swish_parse_file( swish_3 * s3, xmlChar *filename); unsigned int swish_parse_fh( swish_3 * s3, FILE * fh); int swish_parse_buffer( swish_3 * s3, xmlChar * buf); unsigned int swish_parse_directory( swish_3 *s3, xmlChar *dir, boolean follow_symlinks );
xmlChar * swish_io_slurp_fh( FILE * fh, unsigned long flen, boolean binmode ); xmlChar * swish_io_slurp_file_len( xmlChar *filename, off_t flen, boolean binmode ); xmlChar * swish_io_slurp_gzfile_len( xmlChar *filename, off_t flen, boolean binmode ); xmlChar * swish_io_slurp_file( xmlChar *filename, off_t flen, boolean is_gzipped, boolean binmode ); long int swish_io_count_operable_file_lines( xmlChar *filename ); boolean swish_io_is_skippable_line( xmlChar *str );
boolean swish_fs_file_exists( xmlChar *filename ); boolean swish_fs_is_dir( xmlChar *path ); boolean swish_fs_is_file( xmlChar *path ); boolean swish_fs_is_link( xmlChar *path ); off_t swish_fs_get_file_size( xmlChar *path ); time_t swish_fs_get_file_mtime( xmlChar *path ); xmlChar * swish_fs_get_file_ext( xmlChar *url ); boolean swish_fs_looks_like_gz( xmlChar *file );
int swish_hash_add( xmlHashTablePtr hash, xmlChar *key, void * value ); int swish_hash_replace( xmlHashTablePtr hash, xmlChar *key, void *value ); int swish_hash_delete( xmlHashTablePtr hash, xmlChar *key ); boolean swish_hash_exists( xmlHashTablePtr hash, xmlChar *key ); int swish_hash_exists_or_add( xmlHashTablePtr hash, xmlChar *key, xmlChar *value ); void swish_hash_merge( xmlHashTablePtr hash1, xmlHashTablePtr hash2 ); void * swish_hash_fetch( xmlHashTablePtr hash, xmlChar *key ); void swish_hash_dump( xmlHashTablePtr hash, const char *label ); xmlHashTablePtr swish_hash_init(int size); void swish_hash_free( xmlHashTablePtr hash );
void swish_mem_init(); void * swish_xrealloc(void *ptr, size_t size); void * swish_xmalloc( size_t size ); void swish_xfree( void *ptr ); void swish_mem_debug(); long int swish_memcount_get(); void swish_memcount_dec(); xmlChar * swish_xstrdup( const xmlChar * ptr ); xmlChar * swish_xstrndup( const xmlChar * ptr, int len );
double swish_time_elapsed(void); double swish_time_cpu(void); char * swish_time_print(double time); char * swish_time_print_fine(double time); char * swish_time_format(time_t epoch);
void swish_set_error_handle( FILE *where ); void swish_croak(const char *file, int line, const char *func, const char *msg,...); void swish_warn(const char *file, int line, const char *func, const char *msg,...); void swish_debug(const char *file, int line, const char *func, const char *msg,...);
char * swish_get_locale(); void swish_verify_utf8_locale(); boolean swish_is_ascii( xmlChar *str ); int swish_bytes_in_wchar( int wchar ); int swish_utf8_chr_len( xmlChar *utf8 ); uint32_t swish_utf8_codepoint( xmlChar *utf8 ); int swish_utf8_num_chrs( xmlChar *utf8 ); void swish_utf8_next_chr( xmlChar *s, int *i ); void swish_utf8_prev_chr( xmlChar *s, int *i ); xmlChar * swish_str_escape_utf8( xmlChar *utf8 ); xmlChar * swish_str_unescape_utf8( xmlChar *ascii ); wchar_t * swish_locale_to_wchar(xmlChar * str); xmlChar * swish_wchar_to_locale(wchar_t * str); wchar_t * swish_wstr_tolower(wchar_t *s); xmlChar * swish_str_tolower(xmlChar *s ); xmlChar * swish_utf8_str_tolower(xmlChar *s); xmlChar * swish_ascii_str_tolower(xmlChar *s); xmlChar * swish_str_skip_ws(xmlChar *s); void swish_str_trim_ws(xmlChar *string); void swish_str_ctrl_to_ws(xmlChar *s); boolean swish_str_all_ws(xmlChar * s); boolean swish_str_all_ws_len(xmlChar * s, int len); void swish_debug_wchars( const wchar_t * widechars ); int swish_wchar_t_comp(const void *s1, const void *s2); int swish_sort_wchar(wchar_t *s); swish_StringList * swish_stringlist_build(xmlChar *line); swish_StringList * swish_stringlist_init(); void swish_stringlist_free(swish_StringList *sl); unsigned int swish_stringlist_add_string(swish_StringList *sl, xmlChar *str); void swish_stringlist_merge(swish_StringList *sl1, swish_StringList *sl2); swish_StringList * swish_stringlist_copy(swish_StringList *sl); swish_StringList * swish_stringlist_parse_sort_string(xmlChar *sort_string, swish_Config *cfg); void swish_stringlist_debug(swish_StringList *sl); int swish_string_to_int( char *buf ); boolean swish_string_to_boolean( char *buf ); xmlChar * swish_int_to_string( int val ); xmlChar * swish_long_to_string( long val ); xmlChar * swish_double_to_string( double val ); xmlChar * swish_date_to_string( int y, int m, int d ); char swish_get_C_escaped_char(xmlChar *s, xmlChar **se);
swish_Config * swish_config_init(); void swish_config_set_default( swish_Config *config ); void swish_config_merge( swish_Config *config1, swish_Config *config2 ); swish_Config * swish_config_add( swish_Config * config, xmlChar * conf ); swish_Config * swish_config_parse( swish_Config * config, xmlChar * conf ); void swish_config_debug( swish_Config * config ); void swish_config_free( swish_Config * config); xmlHashTablePtr swish_mime_defaults(); xmlChar * swish_mime_get_type( swish_Config * config, xmlChar * fileext ); xmlChar * swish_mime_get_parser( swish_Config * config, xmlChar *mime ); void swish_config_test_alias_fors( swish_Config *c ); swish_ConfigFlags * swish_config_flags_init(); void swish_config_flags_free( swish_ConfigFlags *flags ); void swish_config_test_alias_fors( swish_Config *config ); void swish_config_test_unique_ids( swish_Config *config );
swish_Parser * swish_parser_init( void (*handler) (swish_ParserData *) ); void swish_parser_free( swish_Parser * parser );
swish_TokenList * swish_token_list_init();
void swish_token_list_free( swish_TokenList *tl );
int swish_token_list_add_token(
swish_TokenList *tl,
xmlChar *token,
int token_len,
swish_MetaName *meta,
xmlChar *context );
int swish_token_list_set_token(
swish_TokenList *tl,
xmlChar *token,
int len );
swish_Token * swish_token_init();
void swish_token_free( swish_Token *t );
swish_TokenIterator *swish_token_iterator_init( swish_Analyzer *a );
void swish_token_iterator_free( swish_TokenIterator *ti );
swish_Token * swish_token_iterator_next_token( swish_TokenIterator *it );
int swish_tokenize( swish_TokenIterator *ti,
xmlChar *buf,
swish_MetaName *meta,
xmlChar *context );
int swish_tokenize_ascii(
swish_TokenIterator *ti,
xmlChar *buf,
swish_MetaName *meta,
xmlChar *context );
int swish_tokenize_utf8(
swish_TokenIterator *ti,
xmlChar *buf,
swish_MetaName *meta,
xmlChar *context );
void swish_token_list_debug( swish_TokenIterator *it );
xmlChar * swish_token_list_get_token_value( swish_TokenList *tl, swish_Token *t );
void swish_token_debug( swish_Token *t );
swish_Analyzer * swish_analyzer_init( swish_Config * config ); void swish_analyzer_free( swish_Analyzer * analyzer );
swish_DocInfo * swish_docinfo_init();
void swish_docinfo_free( swish_DocInfo * ptr );
int swish_docinfo_check(swish_DocInfo * docinfo, swish_Config * config);
int swish_docinfo_from_filesystem( xmlChar *filename,
swish_DocInfo * i,
swish_ParserData *parser_data );
void swish_docinfo_debug( swish_DocInfo * docinfo );
swish_NamedBuffer * swish_nb_init( xmlHashTablePtr confhash );
void swish_nb_free( swish_NamedBuffer * nb );
void swish_nb_debug( swish_NamedBuffer * nb, xmlChar * label );
void swish_nb_add_buf( swish_NamedBuffer *nb,
xmlChar * name,
xmlBufferPtr buf,
xmlChar * joiner,
boolean cleanwsp,
boolean autovivify);
void swish_nb_add_str( swish_NamedBuffer * nb,
xmlChar * name,
xmlChar * str,
unsigned int len,
xmlChar * joiner,
boolean cleanwsp,
boolean autovivify);
void swish_buffer_append( xmlBufferPtr buf, xmlChar * txt, int len );
xmlChar* swish_nb_get_value( swish_NamedBuffer* nb, xmlChar* key );
swish_Property * swish_property_init( xmlChar *propname ); void swish_property_free( swish_Property *prop ); void swish_property_debug( swish_Property *prop ); int swish_property_get_id( xmlChar *propname, xmlHashTablePtr properties );
swish_MetaName * swish_metaname_init( xmlChar *name); void swish_metaname_free( swish_MetaName *m ); void swish_metaname_debug( swish_MetaName *m );
boolean swish_header_validate(char *filename); boolean swish_header_merge(char *filename, swish_Config *c); swish_Config * swish_header_read(char *filename); void swish_header_write(char* filename, swish_Config* config);
libswish3 is the core C library of Swish3 .
libswish3 uses the GNOME Libxml2 library to parse words and metadata from XML, HTML and plain text files. libswish3 supports full UTF-8 encoding.
libswish3 is a parsing tool for use with information retrieval (IR) libraries. Dynamic language bindings are available in the source distribution in the bindings directory.
The following APIs are defined:
libswish3 provides three basic input functions:
-
swish_parse_file()
-
swish_parse_fh()
-
swish_parse_buffer()
Each of these functions takes a swish_Parser struct pointer and optional user_data .
In addition:
-
The swish_parse_file() function takes a file path, which must be a valid file. Directories and links are not supported. The assumption is that you will use your calling code to recurse through directories and handle links.
-
swish_parse_buffer() takes a string representing the document headers and the full text of the document.
-
swish_parse_fh() takes a filehandle pointer, which if set to NULL, defaults to stdin.
See the Headers API section for more information on using swish_parse_fh() and swish_parse_buffer().
See the handler Function section for more information on how to deal with the data extracted by each of the swish_parse_* functions.
The Headers API supports and extends the Swish-e -S prog feature, which allows you to feed the indexer with output from another prog ram. The API has been extended from Swish-e's to allow for MIME types and more congruence with the HTTP 1.1 specification.
See SWISH-RUN documentation in the Swish-e distribution for the Swish-e version 2 headers API.
This is the libswish3 implementation. See SWISH::Prog::Headers for a simple Perl-based way of generating the proper headers.
- Content-Location
Swish-e name: Path-Name
The name of the document. May be any string: an ID of a record in a database, a URL or a simple file name. The string is stored in the swish_DocInfo uri struct member, which is often used as the primary identifier of a document in an index.
This header is required.
- Content-Length
The length in bytes of the document, starting after the blank line separating the headers from the document itself. The value must be exactly the length of the document, including any extra line feeds or carriage returns at the end of the document.
Example:
Content-Location: foo.html Content-Length: 9
The doc.\n ^^^^^^^^ ^ 12345678 9
The value is stored in the swish_DocInfo size struct member.
This header is required.
- Last-Modified
Swish-e name: Last-Mtime
The last modification time of the document. The value must be an integer: the seconds since the Epoch on your system.
If not present, will default to the current time.
The value is stored in the swish_DocInfo mtime struct member.
This header is not required.
- Parser-Type
Swish-e name: Document-Type
Explicitly name the parser used for the document, rather than defaulting to the MIME type mapping based on Content-Type and/or Content-Location . The three parser types are:
-
XML
-
HTML
-
TXT
The Swish-e values XML2 , XML* , HTML2 , HTML* , TXT2 , TXT* are also supported for compatibility, but they map to the three libswish3 types.
The value is stored in the swish_DocInfo parser struct member.
If not present, the document parser will be automatically chosen based on the following logic:
-
If a Content-Type is given, the parser mapped to that MIME type will be used. You may override the default mappings in your configuration. See Configuration API .
-
If no Content-Type is given, a MIME type will be guessed at based on the file extension of the document's Content-Location , and the parser mapped to that MIME type will be used.
-
Finally, if a MIME type is not identified, the parser defined in SWISH_DEFAULT_PARSER in libswish3.h will be used.
See also Content-Type and Content-Location .
This header is not required.
-
- Content-Type
The MIME type of the document. The libswish3 MIME type list is based on the Apache 2.0 version. See http://www.iana.org/assignments/media-types/ for the official registry.
If not defined with Content-Type , the MIME type will be guessed based on the file extension in the Content-Location header. If the Content-Location string does not contain a file extension (as might be the case with non-URL value), or the file extension has no MIME mapping, then the MIME type will default to SWISH_DEFAULT_MIME as defined in libswish3.h .
You may override the default extension-to-MIME mappings in your configuration. See Configuration API .
The value is stored in the swish_DocInfo mime struct member.
See also Content-Location and Parser-Type .
This header is not required.
- Update-Mode
NOTE: This header exists only for backwards compatibility with Swish-e's incremental index feature. It may be deprecated in a future version of libswish3.
Writing an effective handler function requires an understanding of some of the key libswish3 data structures.
For more details on any of these structures, see the SYNOPSIS.
The main data structure. A swish_3 object has a swish_Config, swish_Analyzer and swish_Parser object and delegates to each as appropriate.
This is typically the only object you need to create and use.
A configuration object. This object is required for initializing both a swish_Analyzer object and a swish_Parser object.
A parser object. Required for executing any of the three swish_parse_* functions.
A parser data object. This object is passed around internally by the libxml2 SAX2 handlers, and is eventually the object passed to the handler function pointer. See The handler Function .
A list of words or tokens. The object is typically accessed via a swish_TokenIterator, like this:
// example of swish_TokenIterator
swish_Token *t;
while ((t = swish_next_token(token_iterator)) != NULL) {
SWISH_DEBUG_MSG("\n\
t->ref_cnt = %d\n\
t->pos = %d\n\
t->context = %s\n\
t->meta = %d [%s]\n\
t->len = %d\n\
t->value = %s\n\
", t->ref_cnt, t->pos, t->context, t->meta->id, t->meta->name, t->len, t->value);
} See the swish_debug_token_list() function from which the code above is taken.
An object representing one word or token. The word's position relative to other words, length, tag context and MetaName are all available in the object.
An object describing metadata about the document itself: URI, MIME type, size, etc.
The Analyzer object controls how the character content of a document is parsed: whether or not a WordList is created with a tokenizer, if the words (tokens) are lowercased or stemmed, etc.
The handler function pointer is the final link in the parsing chain. The function pointer is set in the swish_Parser object constructor, and is called by each of the swish_parse_* functions after the entire document has been parsed and (optionally) tokenized.
The handler receives one argument: a swish_ParserData object containing all the metadata and words in the document.
If all you wanted to do was print out a report about each document as it was parsed, your handler function might be as simple as:
void
my_handler( swish_ParserData * parse_data )
{
swish_docinfo_debug( parse_data->docinfo );
swish_token_list_debug( parse_data->token_iterator );
swish_nb_debug( parse_data->properties, "Property" );
swish_nb_debug( parse_data->metanames, "MetaName" );
} IMPORTANT: After the handler function is called, all the structures referenced by the swish_ParserData object are automatically freed, so if you intend to keep any of the data for storing in an index, you will need to strdup() words, properties, docinfo, etc. as part of your indexing code.
See the example swish_lint.c file for how to create and pass in a handler function pointer to the swish_3_init() constructor.
Configuration is different with libswish3 than with Swish-e. The biggest change is that libswish3 configuration files are written in XML. This is done for several reasons:
- 1
Since libswish3 already has a powerful XML parser built-in, it's much easier to parse a configuration file written in XML than to port the Swish-e config parser to libswish3 .
- 2
libswish3 stores index header information in a XML format nearly identical to the configuration file format. So the parser needs to understand only one XML schema.
- 3
You can store UTF-8 text in your configuration file and it will be parsed correctly.
- 4
The configuration directive list is extensible. Simple key/value configuration directives can be added without any modification to the libswish3 config parser. They are simply stored in the swish_Config struct hash for your own use and amusement.
CAUTION: The configuration directive names documented in the Directives section below are reserved for use by libswish3 . Some of them have special handling considerations (like MetaNames and PropertyNames). So the important idea to grasp with the extensible configuration feature is "simple key/value pairs."
This section describes how to build a libswish3 configuration file.
Here's an example libswish3 configuration file:
<swish> <FollowSymLinks>yes</FollowSymLinks> <MetaNames> <foo bias="+10" /> <bar bias="-5" /> <swishtitle bias="+50" /> <title alias_for="swishtitle" /> <other>color size weight</other> </MetaNames> <PropertyNames> <foo type="text" ignore_case="1" /> <bar type="int" /> <lastmod type="date" /> <bing ignore_case="0" /> <description verbatim="1" max="10000" alias="body" sort_length="20" /> <notsorted sort="0" /> </PropertyNames> <Tokenize>1</Tokenize> </swish>
And here's that same example, dissected:
<swish>
The top level tag.
<FollowSymLinks>yes</FollowSymLinks>
Equivalent to the Swish-e style:
FollowSymLinks yes
which simply informs whatever aggregator you are using that when confronted with a symlink on the filesystem, it should be followed.
FollowSymLinks is an example of a simple key/value pair (see the CAUTION above).
Here's the first big difference from Swish-e. MetaNames, MetaNameAlias, and MetaNamesRank have been combined into a single XML tag with appropriate attributes.
<foo bias="10" />
is the same thing as (in Swish-e style):
MetaNames foo MetaNamesRank 10 foo
while:
<swishtitle bias="50" /> <title alias_for="swishtitle" />
is equivalent to:
MetaNames swishtitle MetaNameAlias swishtitle title MetaNamesRank 50 swishtitle
You can see that the XML style allows for a terser, more compact expression. You can still assign multiple aliases to a single MetaName:
<other>color size weight</other>
is equivalent to:
MetaNames other MetaNameAlias other color size weight
In addition, there are some special features intended for use with HTML documents.
<links html="1" alias_for="href" /> # same as HTMLLinksMetaName <images html="1" alias_for="src" /> # same as ImageLinksMetaName <alttext html="1" alias_for="alt" /> # same as IndexAltTagMetaName
PropertyNames, PropertyNamesCompareCase, PropertyNamesIgnoreCase, PropertyNamesNoStripChars, PropertyNamesNumeric, PropertyNamesDate, PropertyNameAlias, PropertyNamesMaxLength, PropertyNamesSortKeyLength, StoreDescription and PreSortedIndex have all been combined into a single XML directive.
Here's the example from above with equivalent Swish-e directives annotated:
<foo ignore_case="1" /> # PropertyNamesIgnoreCase foo
<bar type="int" /> # PropertyNamesNumeric bar <lastmod type="date" /> # PropertyNamesDate lastmod <bing comparecase="1" /> # PropertyNamesCompareCase bing <description verbatim="1" max="10000" alias="body" sort_length="20" /> # PropertyNamesNoStripChars description # PropertyNamesMaxLength 10000 description # PropertyNameAlias description body # PropertyNamesSortKeyLength 20 description
<notsorted sort="0" /> # PreSortedIndex foo bar lastmod bind description
Again, the XML format greatly simplifies the syntax. You can assign attributes as you need, though be aware that some attributes are inherently mismatched and might generate an error or unexpected behaviour:
<foo ignore_case="1" type="int" /> # wrong <foo ignore_case="0" type="date" /> # wrong <foo verbatim="1" type="int" /> # wrong <foo sort="0" max="20" /> # wrong
The following configuration directives are currently reserved by libswish3.
These are top-level tags within the
- swish_version
Contains the value of the SWISH_VERSION constant.
- swish_lib_version
Contains the value of the SWISH_LIB_VERSION constant.
- MetaNames
Contains MetaName definitions.
- PropertyNames
Contains PropertyName definitions.
- Parsers
Contains a mapping of Parser name to MIME types. Example:
<Parsers> <XML>application/xml</XML> <HTML>text/html</HTML> <TXT>text/plain</TXT> <XML>text/foo</XML> <HTML>default</HTML> <HTML>foo/bar</HTML> </Parsers>
- MIME
Contains a mapping of file extensions to MIME types. Example:
<MIME> <au>foo/bar</au> </MIME>
- Index
Contains attributes of the inverted index. Example (with defaults):
<Index> <Format>Native</Format> <Name>index.swish</Name> <Locale>en_US.UTF-8</Locale> </Index>
- TagAlias
Contains mapping of alias names to tag names.
<TagAlias> <swishdescription>body</swishdescription> <swishtitle>title</swishtitle> <swishtitle>foo</swishtitle> <swishtitle>bar</swishtitle> </TagAlias>
- Tokenize
Toggle the tokenizer on or off. Default is yes (on).
<Tokenize>yes</Tokenize> =item CascadeMetaContext
Toggle the cascading effect of MetaName context. The default is no (off).
<CascadeMetaContext>no</CascadeMetaContext>
The effect is to consider a Token assigned to every MetaName in its DOM stack. An example:
<doc> <metaone> foo <metatwo>bar</metatwo> </metaone> </doc>
If CascadeMetaContext is true (on), then the token bar will be tracked as both MetaName metaone and metatwo . If CascadeMetaContext is false (off), then the token bar will be only metatwo .
See the swish_lint.c file included in the libswish3 distribution.
Information Retrieval.
libswish3 is the core parsing library for Swish-e version 3 (Swish3).
No. libswish3 is a document parser. It might work well in or with any number of search engines, but it is not in itself any kind of search tool.
libswish3 reads text, HTML and XML files and extracts just the words and document properties from each document. It then hands off the token list and properties to a handler function. Finally, it frees all the memory associated with the token list and properties.
The handler function can do whatever you wish, though typically a handler would iterate over the words in the token list and add each one to an index using an IR library API.
libswish3 is part of the Swish-e project. It was born out of the need for UTF-8 and incremental indexing support and a desire to experiment with alternate indexing libraries like Lucene, KinoSearch, Xapian and Hyperestraier.
libswish3 was developed with the idea that many quality IR libraries already exist, but few if any provide an easy and fast way of preparing documents for indexing. The following assumptions informed the development of libswish3.
A decent IR toolchain requires 5 parts:
- aggregator
Collects documents from a filesystem, database, website or other sources.
- filter
Normalizes documents to a standard format (plain text or a delimited/markup like YAML, HTML or XML) for indexing.
- parser
Breaks a document into a list of words, including their context and position.
- indexer
Writes the list of words in a storage system for quick, efficient retrieval.
- searcher
Parses queries and fetches data from the indexer's storage system.
Of course, the division between these parts is not always clean or apparent. Parsing search queries, for example, will necessarily involve elements of the parser and searcher components, while the indexer and searcher are of necessity intrinsically bound.
But any complete IR system will have these five parts in some combination.
The existing Swish-e document aggregators ( DirTree.pl and spider.pl ) and filtering system ( SWISH::Filter ) are good. They are all written in Perl and are easily modified, and they have ample configuration options and documentation.
Several good IR libraries exist that provide an indexer and searcher. These libraries do UTF-8, incremental indexing, and have search syntax on par with (or better than) Swish-e 2.x. Examples include Xapian, KinoSearch and Lucene. While they might be a little slower than Swish-e (at least in terms of indexing speed) they make up that for with:
-
well-documented APIs
-
bindings in a variety of programming languages
-
active development communities
-
the flexibility that comes with being a library instead of a fixed program
The piece that Swish-e provides that other IR libraries lack is a fast, stable, integrated document parser. Xapian has Omega, but it does not parse XML, nor does it recognize ad hoc word context (metanames).
However, the Swish-e 2.x parser does not work independently of the Swish-e indexer and searcher, nor does it support UTF-8.
One piece is missing: a parser that works with the Swish-e aggregator/filter system, supports UTF-8, and offers flexible options for connecting with other IR libraries.
Ergo, libswish3: a document parser compatible with the existing Swish-e -S prog API and capable of generating UTF-8 token lists for indexing with a variety of IR libraries.
libswish3 is the core C library in Swish3.
However, libswish3 may be used without the rest of the Swish3. The assumption is that libswish3 could fit into an IR toolchain like this:
aggregator -> filter -> libswish3 -> some IR library
You could then use the native search API of the IR library.
For example, you might use the Swish-e spider.pl script to spider a website, filtering documents with SWISH::Filter and then handing the output to a libswish3 -based program that will parse the documents into words and store the data in a Xapian or KinoSearch index (or both!). That model is, in fact, what Swish3 does.
Or you might use the SWISH::Prog Perl module (from the CPAN) to build your own aggregator/filter system, then hand the output to libswish3.
Peter Karman (peter@peknet.com).
libswish3 is inspired by code from Swish-e (http://www.swish-e.org), Libxml2 (http://www.xmlsoft.org), Apache (http://www.apache.org), Rahul Dhesi (http://www.tug.org/tex-archive/tools/zoo/), Angel Ortega (http://www.triptico.com/software/unicode.html), James Henstridge (http://www.jamesh.id.au/articles/libxml-sax/libxml-sax.html), YoLinux (http://www.yolinux.com/TUTORIALS/GnomeLibXml2.html) and no doubt many unnamed others.
All mistakes, errors and poor programming choices are, however, those of the author.
libswish3 is licensed under the GPL.
libswish3 is free software; you can redistribute it and/or modify it under the terms of the GNU Library General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version.
libswish3 is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Library General Public License for more details.
You should have received a copy of the GNU Library General Public License along with libswish3; see the file COPYING. If not, write to the
Free Software Foundation, Inc. 59 Temple Place - Suite 330 Boston, MA 02111-1307, USA
The project homepage: http://dev.swish-e.org/wiki/swish3
swish_lint(1), swish_isw(1), swish_words(1)
Uploaded first pass at both implementations this last week. The announcement to the Swish-e list just went out.
After 4 years of learning how to glue Perl and C together with XS and many sleepless nights, I have released SWISH::3 to the CPAN.
<cue the sound of scattered applause>
Mostly this is a triumph of longevity rather than quality code. It's taken me this long to get something workable.
The Xapian backend for Swish3 has been getting some love lately. The swish_xapian command line tool has most of the features now that swish-e v2.x does.
I've posted about it on the Swish-e wiki.
I just wasted many hours trying to figure out why libswish3 failed to pass all tests on 10.6.
This link explains what I figured out to be true the hard way:
10.6 is now a mainsream 64bit OS !
10.5 a 64bit capable 32bit OS !
If I forced 32bit compile all is well:
CFLAGS="-m32 -O2 -g" ./configure && make test
While I would like to figure out how to compile as a native 64bit app, my MacBook has too many libs from before the 10.5 to 10.6 upgrade to trust that all the dep chain is 64bit compat.
The error I was seeing was the noxious BAD_ADDRESS error which traced back to some libxml2 hash features. Red herring. Of course, I had to recompile libxml2 with the -m32 as well so that everything was 32bit compatible. Took me hours before I noticed that the older working version on the same box was about half the size of the new version... which triggered the ol' 32-vs-64-bit thing in my brain.
Update: In the end this was a bug in libswish3 with confusing naming of some variables. But the 64-bit thing was a Good Thing To Realize.
Swish is the Simple Web Indexing System for Humans. Swish is an information retrieval tool. It is not a search engine, but can be used as an integral part of creating a search engine. Swish gathers, parses, indexes and searches document collections. A collection can be any set of real or virtual documents: web pages, database rows, PDFs or office files, or anything else that can be converted to text.
Swish3 is version three of Swish. Kevin Hughes wrote the original version in 1994. In 2000, the project was updated and released as Swish-e version 2 (the -e is for Enhanced). Swish3 is the third phase in the evolution of the project.
In this document, the name Swish will refer to the entire project, without regard to a particular version. Swish-e will refer specifically to version 2.x. Swish3 will refer specifically to version three.
The following description could apply to any search system or information retrival project, not just Swish. First we'll look at the various parts of the system, then look at how they are implemented in Swish3.
Every search system implements the following chain of features:
- aggregator
An aggregator assembles documents into a collection. It can be as simple as a filesystem tool like the Unix find command or as sophisticated as a web crawler. An aggregator selects documents based on various criteria: content, MIME type or format, date, author, URL, or any other criteria that you desire.
- normalizer
A normalizer verifies that all documents the aggregator collects are in a format that the analyzer can parse. For example, a binary file format like PDF is converted to HTML or unmarked text. The same is true for all office file formats, PostScript, etc.
- analyzer
An analyzer examines the text supplied via the aggegrator/normalizer steps. The analyzer does several things, some of them optional:
- parsing
Separates text from any surrounding markup, optionally remembering the context (tag) in which text was found.
- case folding
Changes the text to all lowercase or all uppercase, to make comparisons easier.
- tokenizing
Splitting a string of text into tokens or words.
- stemming
Using one of a variety of word-stemming algorithms, tries to discover the root stem of each word.
- customization
Many advanced analyzers offer some level of customization to apply at some point in the analysis, whether it be synonym matching or other linguistic logic.
- parsing
- indexer
An indexer stores basic document metadata and token (word) information in an index for fast and efficient retrieval.
- searcher
A searcher parses a user query using the same logic used by the analyser when processing the original document collection, applies some well-defined rules for matching documents in the index, and then returns results, typically a list or iterator of matching documents.
Now let's look at how Swish3 implements these five features.
The first thing to know about Swish3 is that, unlike previous versions of Swish, there is not a single Swish3 implementation.
That might sound confusing at first, because it is a significant departure from earlier versions of Swish, where there was a primary program, written in C, which handled all five links in the search chain. Swish3 takes a different approach.
Swish3 is primarily a C library called libswish3 . The library has a well-defined list of public functions and data structures that aim to fill a particular void in the world of information retrieval tools: analyzing HTML and XML documents.
Swish3 takes as its starting point the -S prog feature of Swish-e, where you can define your own aggregator/normalizer program, and makes that Swish3's central feature. Swish3 extends the -S prog API to include additional header values, and adds the same MIME-type-matching feature as the popular Apache web server.
Swish3 has no native indexer or searcher features [TODO: this might change if the 2.6 BDB backend is ported]. Nor does it have any aggregator or normalizer features. Swish3 is primarily an analyzer.
The Swish3 distribution does come with some examples of how to write Swish3 applications, including an example program for using the popular Xapian library. And there is a Perl implementation based on the SWISH::Prog package.
libswish3 defines hooks or callbacks where you can override the default behaviour of the analyzer. These hooks are intended for making it easy to plug libswish3 into the indexing chain.
Here's one example. If you wanted to index a web site, you might use an aggregator/normalizer tool like Swish-e's spider.pl . spider.pl will print its output on stdout.
% spider.pl your_config > spider_output
Then you could use a program like swish_xapian to analyze and index the output:
% swish_xapian -c swish.conf - < spider_output
If you look at the source for the swish_xapian program, in the libswish3 distribution, you will see that there is a handler function defined that takes the output of the libswish3 parsing function and adds it to a Xapian index.
This document provides an overview of Swish3's anatomy. You might also be interested in these docs:
If you haven't already, read the Introduction to Swish3 document first.
This document is intended for users already familiar with Swish-e version 2.x who want to migrate to using Swish3.
Swish3 is intended to be one part of a search system tool chain. In this section we will look at how Swish-e implements each of the tool chain features, and then compare it to Swish3.
Swish-e has two built-in aggregators, for filesystem and web, indicated with the -S flag to the swish-e command. Swish-e also has a third -S option called prog , which is short for program . The program is an aggregator that you define. Swish-e ships with several example aggregators, including a filesystem crawler called DirTree.pl and a web crawler called spider.pl . There are also example aggregators for pulling data from a database and for specific kinds of documents, like Hypermail mail archives.
Swish3 has no built-in aggregators. Instead, Swish3 takes the -S prog approach of defining an API for external aggregators to follow.
Swish-e has a feature called FileFilter which allows you define an external program to call if a document's name matches a particular pattern. The file is handed to the external program and the output of the external program is treated as the contents of the document. For example, you can specify that all documents that end with .pdf are first filtered through the pdftotext command.
Swish-e also comes with a set of Perl modules bundled together as SWISH::Filter . SWISH::Filter is used by the external aggregators like DirTree.pl and spider.pl , thus making those programs both aggregators and normalizers.
Swish3 has no built-in normalizer or feature like FileFilter . Instead, Swish3 assumes that something like SWISH::Filter will be used to standardize documents before they are handed to Swish3.
One of the biggest changes is the configuration file format. Swish3 uses XML-style configuration files, and supports a subset of the configuration options available in Swish-e.
This section documents the configuration options supported in Swish3.