$VAR1 = {
'blog' => [
bless( {
'file' => '/home/karpet/blog/projects/swish/api_docs/swish_intro.7.pod',
'format' => 'pod',
'id' => 'swish_intro.7.pod',
'mtime' => 1208317559,
'name' => 'swish_intro.7',
'text' => '=pod
=head1 Introduction to Swish3
Swish is the Simple Web Indexing System for Humans. Swish
is an information retrieval tool. It is B<not> a search engine, but
can be used as an integral part of creating a search engine. Swish gathers,
parses, indexes and searches document collections. A collection can be any
set of real or virtual documents: web pages, database rows, PDFs or office
files, or anything else that can be converted to text.
Swish3 is version three of Swish.
Kevin Hughes wrote the original version in 1994. In 2000, the project was
updated and released as Swish-e version 2 (the -e is for Enhanced). Swish3
is the third phase in the evolution of the project.
In this document, the name C<Swish> will refer to the entire project,
without regard to a particular version. C<Swish-e> will refer specifically
to version 2.x. C<Swish3> will refer specifically to version three.
=head2 Anatomy of a Search Tool Chain
The following description could apply to any search system or information
retrival project, not just Swish. First we\'ll look at the various
parts of the system, then look at how they are implemented in Swish3.
Every search system implements the following chain of features:
=over
=item aggregator
An aggregator assembles documents into a collection. It can be as simple
as a filesystem tool like the Unix B<find> command or as sophisticated as a
web crawler. An aggregator selects documents based on various criteria:
content, MIME type or format, date, author, URL, or any other criteria
that you desire.
=item normalizer
A normalizer verifies that all documents the aggregator collects are in a format
that the analyzer can parse. For example, a binary file format like PDF is
converted to HTML or unmarked text. The same is true for all office file formats,
PostScript, etc.
=item analyzer
An analyzer examines the text supplied via the aggegrator/normalizer steps.
The analyzer does several things, some of them optional:
=over
=item parsing
Separates text from any surrounding markup, optionally
remembering the context (tag) in which text was found.
=item case folding
Changes the text to all lowercase or all uppercase, to make comparisons
easier.
=item tokenizing
Splitting a string of text into tokens or words.
=item stemming
Using one of a variety of word-stemming algorithms, tries to discover the root
C<stem> of each word.
=item customization
Many advanced analyzers offer some level of customization to apply at some
point in the analysis, whether it be synonym matching or other linguistic
logic.
=back
=item indexer
An indexer stores basic document metadata and token (word) information
in an index for fast and efficient retrieval.
=item searcher
A searcher parses a user query using the same logic used by the analyser
when processing the original document collection,
applies some well-defined rules for matching documents in the index,
and then returns results, typically a list or iterator of matching documents.
=back
Now let\'s look at how Swish3 implements these five features.
=head2 A Library, Not a Command
The first thing to know about Swish3 is that, unlike previous versions of
Swish, there is not a single Swish3 implementation.
That might sound confusing at first, because it is a significant
departure from earlier versions of Swish, where there was a primary
program, written in C, which handled all five links in the search chain.
Swish3 takes a different approach.
Swish3 is primarily a C library called B<libswish3>. The library has a
well-defined list of public functions and data structures that aim
to fill a particular void in the world of information retrieval tools:
analyzing HTML and XML documents.
Swish3 takes as its starting point the B<-S prog> feature of Swish-e,
where you can define your own aggregator/normalizer program, and makes that
Swish3\'s central feature. Swish3 extends the B<-S prog> API to include
additional header values, and adds the same MIME-type-matching feature
as the popular Apache web server.
Swish3 has no native indexer or searcher features [TODO: this might change
if the 2.6 BDB backend is ported]. Nor does it have any aggregator or normalizer
features. Swish3 is primarily an analyzer.
The Swish3 distribution does come with some examples of how to write Swish3
applications, including an example program for using the popular Xapian
library. And there is a Perl implementation based on the SWISH::Prog package.
=head2 So How Does It Work?
libswish3 defines hooks or callbacks where you can override the default
behaviour of the analyzer. These hooks are intended for making it easy to
plug libswish3 into the indexing chain.
Here\'s one example. If you wanted to index a web site, you might use an
aggregator/normalizer tool like Swish-e\'s B<spider.pl>. spider.pl will print its
output on stdout.
% spider.pl your_config > spider_output
Then you could use a program like B<swish_xapian> to analyze and index the
output:
% swish_xapian -c swish.conf - < spider_output
If you look at the source for the B<swish_xapian> program, in
the libswish3 distribution, you will see that there is a B<handler> function
defined that takes the output of the libswish3 parsing function and
adds it to a Xapian index.
=head2 See Also
This document provides an overview of Swish3\'s anatomy. You might also be
interested in these docs:
=over
=item
L<Migrating from Swish-e to Swish3|swish_migration.7>
=item
L<Perl implementation of Swish3|SWISH::Prog>
=item
L<libswish3 API|libswish3.3>
=back
',
'url' => 'projects/swish/api_docs/swish_intro.7'
}, 'PodBlog::Model::Blog::Entry' )
],
'menu' => [
bless( {
'dir' => 1,
'file' => '/home/karpet/blog/books',
'level' => 1,
'name' => 'books',
'url' => 'books'
}, 'PodBlog::Model::Menu::Entry' ),
bless( {
'dir' => 1,
'file' => '/home/karpet/blog/general',
'level' => 1,
'name' => 'general',
'url' => 'general'
}, 'PodBlog::Model::Menu::Entry' ),
bless( {
'dir' => 1,
'file' => '/home/karpet/blog/projects',
'level' => 1,
'name' => 'projects',
'url' => 'projects'
}, 'PodBlog::Model::Menu::Entry' ),
bless( {
'dir' => 0,
'file' => '/home/karpet/blog/projects/_intro.txt',
'level' => 2,
'name' => '_intro',
'url' => 'projects/_intro'
}, 'PodBlog::Model::Menu::Entry' ),
bless( {
'dir' => 0,
'file' => '/home/karpet/blog/projects/ajax.txt',
'level' => 2,
'name' => 'ajax',
'url' => 'projects/ajax'
}, 'PodBlog::Model::Menu::Entry' ),
bless( {
'dir' => 0,
'file' => '/home/karpet/blog/projects/blas.txt',
'level' => 2,
'name' => 'blas',
'url' => 'projects/blas'
}, 'PodBlog::Model::Menu::Entry' ),
bless( {
'dir' => 0,
'file' => '/home/karpet/blog/projects/blue.txt',
'level' => 2,
'name' => 'blue',
'url' => 'projects/blue'
}, 'PodBlog::Model::Menu::Entry' ),
bless( {
'dir' => 0,
'file' => '/home/karpet/blog/projects/bug_or_feature.txt',
'level' => 2,
'name' => 'bug_or_feature',
'url' => 'projects/bug_or_feature'
}, 'PodBlog::Model::Menu::Entry' ),
bless( {
'dir' => 0,
'file' => '/home/karpet/blog/projects/catalyst.txt',
'level' => 2,
'name' => 'catalyst',
'url' => 'projects/catalyst'
}, 'PodBlog::Model::Menu::Entry' ),
bless( {
'dir' => 0,
'file' => '/home/karpet/blog/projects/chars.txt',
'level' => 2,
'name' => 'chars',
'url' => 'projects/chars'
}, 'PodBlog::Model::Menu::Entry' ),
bless( {
'dir' => 0,
'file' => '/home/karpet/blog/projects/cpan.txt',
'level' => 2,
'name' => 'cpan',
'url' => 'projects/cpan'
}, 'PodBlog::Model::Menu::Entry' ),
bless( {
'dir' => 0,
'file' => '/home/karpet/blog/projects/craydoc.txt',
'level' => 2,
'name' => 'craydoc',
'url' => 'projects/craydoc'
}, 'PodBlog::Model::Menu::Entry' ),
bless( {
'dir' => 0,
'file' => '/home/karpet/blog/projects/cssprint.txt',
'level' => 2,
'name' => 'cssprint',
'url' => 'projects/cssprint'
}, 'PodBlog::Model::Menu::Entry' ),
bless( {
'dir' => 0,
'file' => '/home/karpet/blog/projects/fp_talk1.txt',
'level' => 2,
'name' => 'fp_talk1',
'url' => 'projects/fp_talk1'
}, 'PodBlog::Model::Menu::Entry' ),
bless( {
'dir' => 0,
'file' => '/home/karpet/blog/projects/fp_talk2.txt',
'level' => 2,
'name' => 'fp_talk2',
'url' => 'projects/fp_talk2'
}, 'PodBlog::Model::Menu::Entry' ),
bless( {
'dir' => 0,
'file' => '/home/karpet/blog/projects/fp_talk3.txt',
'level' => 2,
'name' => 'fp_talk3',
'url' => 'projects/fp_talk3'
}, 'PodBlog::Model::Menu::Entry' ),
bless( {
'dir' => 0,
'file' => '/home/karpet/blog/projects/fp_talks.txt',
'level' => 2,
'name' => 'fp_talks',
'url' => 'projects/fp_talks'
}, 'PodBlog::Model::Menu::Entry' ),
bless( {
'dir' => 0,
'file' => '/home/karpet/blog/projects/frozenperl.txt',
'level' => 2,
'name' => 'frozenperl',
'url' => 'projects/frozenperl'
}, 'PodBlog::Model::Menu::Entry' ),
bless( {
'dir' => 0,
'file' => '/home/karpet/blog/projects/hacker.txt',
'level' => 2,
'name' => 'hacker',
'url' => 'projects/hacker'
}, 'PodBlog::Model::Menu::Entry' ),
bless( {
'dir' => 0,
'file' => '/home/karpet/blog/projects/hiliter.txt',
'level' => 2,
'name' => 'hiliter',
'url' => 'projects/hiliter'
}, 'PodBlog::Model::Menu::Entry' ),
bless( {
'dir' => 0,
'file' => '/home/karpet/blog/projects/http_flow.txt',
'level' => 2,
'name' => 'http_flow',
'url' => 'projects/http_flow'
}, 'PodBlog::Model::Menu::Entry' ),
bless( {
'dir' => 0,
'file' => '/home/karpet/blog/projects/ibmunicode.txt',
'level' => 2,
'name' => 'ibmunicode',
'url' => 'projects/ibmunicode'
}, 'PodBlog::Model::Menu::Entry' ),
bless( {
'dir' => 0,
'file' => '/home/karpet/blog/projects/ideas.txt',
'level' => 2,
'name' => 'ideas',
'url' => 'projects/ideas'
}, 'PodBlog::Model::Menu::Entry' ),
bless( {
'dir' => 0,
'file' => '/home/karpet/blog/projects/ishida.txt',
'level' => 2,
'name' => 'ishida',
'url' => 'projects/ishida'
}, 'PodBlog::Model::Menu::Entry' ),
bless( {
'dir' => 0,
'file' => '/home/karpet/blog/projects/iterm.txt',
'level' => 2,
'name' => 'iterm',
'url' => 'projects/iterm'
}, 'PodBlog::Model::Menu::Entry' ),
bless( {
'dir' => 0,
'file' => '/home/karpet/blog/projects/larry_pm.txt',
'level' => 2,
'name' => 'larry_pm',
'url' => 'projects/larry_pm'
}, 'PodBlog::Model::Menu::Entry' ),
bless( {
'dir' => 0,
'file' => '/home/karpet/blog/projects/latenights.txt',
'level' => 2,
'name' => 'latenights',
'url' => 'projects/latenights'
}, 'PodBlog::Model::Menu::Entry' ),
bless( {
'dir' => 0,
'file' => '/home/karpet/blog/projects/long_live_perl.txt',
'level' => 2,
'name' => 'long_live_perl',
'url' => 'projects/long_live_perl'
}, 'PodBlog::Model::Menu::Entry' ),
bless( {
'dir' => 0,
'file' => '/home/karpet/blog/projects/memory.txt',
'level' => 2,
'name' => 'memory',
'url' => 'projects/memory'
}, 'PodBlog::Model::Menu::Entry' ),
bless( {
'dir' => 0,
'file' => '/home/karpet/blog/projects/mylibrary.txt',
'level' => 2,
'name' => 'mylibrary',
'url' => 'projects/mylibrary'
}, 'PodBlog::Model::Menu::Entry' ),
bless( {
'dir' => 0,
'file' => '/home/karpet/blog/projects/perlisalive.txt',
'level' => 2,
'name' => 'perlisalive',
'url' => 'projects/perlisalive'
}, 'PodBlog::Model::Menu::Entry' ),
bless( {
'dir' => 0,
'file' => '/home/karpet/blog/projects/postgresql_on_osx.txt',
'level' => 2,
'name' => 'postgresql_on_osx',
'url' => 'projects/postgresql_on_osx'
}, 'PodBlog::Model::Menu::Entry' ),
bless( {
'dir' => 0,
'file' => '/home/karpet/blog/projects/profiling_perl.txt',
'level' => 2,
'name' => 'profiling_perl',
'url' => 'projects/profiling_perl'
}, 'PodBlog::Model::Menu::Entry' ),
bless( {
'dir' => 0,
'file' => '/home/karpet/blog/projects/rest.txt',
'level' => 2,
'name' => 'rest',
'url' => 'projects/rest'
}, 'PodBlog::Model::Menu::Entry' ),
bless( {
'dir' => 0,
'file' => '/home/karpet/blog/projects/ror.txt',
'level' => 2,
'name' => 'ror',
'url' => 'projects/ror'
}, 'PodBlog::Model::Menu::Entry' ),
bless( {
'dir' => 0,
'file' => '/home/karpet/blog/projects/stateofsearch.txt',
'level' => 2,
'name' => 'stateofsearch',
'url' => 'projects/stateofsearch'
}, 'PodBlog::Model::Menu::Entry' ),
bless( {
'dir' => 1,
'file' => '/home/karpet/blog/projects/swish',
'level' => 2,
'name' => 'swish',
'url' => 'projects/swish'
}, 'PodBlog::Model::Menu::Entry' ),
bless( {
'dir' => 1,
'file' => '/home/karpet/blog/projects/swish/api_docs',
'level' => 3,
'name' => 'api_docs',
'url' => 'projects/swish/api_docs'
}, 'PodBlog::Model::Menu::Entry' ),
bless( {
'dir' => 0,
'file' => '/home/karpet/blog/projects/swish/api_docs/libswish3.3.pod',
'level' => 4,
'name' => 'libswish3.3',
'url' => 'projects/swish/api_docs/libswish3.3'
}, 'PodBlog::Model::Menu::Entry' ),
bless( {
'dir' => 0,
'file' => '/home/karpet/blog/projects/swish/api_docs/swish_intro.7.pod',
'level' => 4,
'name' => 'swish_intro.7',
'url' => 'projects/swish/api_docs/swish_intro.7'
}, 'PodBlog::Model::Menu::Entry' ),
bless( {
'dir' => 0,
'file' => '/home/karpet/blog/projects/swish/api_docs/swish_isw.1.pod',
'level' => 4,
'name' => 'swish_isw.1',
'url' => 'projects/swish/api_docs/swish_isw.1'
}, 'PodBlog::Model::Menu::Entry' ),
bless( {
'dir' => 0,
'file' => '/home/karpet/blog/projects/swish/api_docs/swish_lint.1.pod',
'level' => 4,
'name' => 'swish_lint.1',
'url' => 'projects/swish/api_docs/swish_lint.1'
}, 'PodBlog::Model::Menu::Entry' ),
bless( {
'dir' => 0,
'file' => '/home/karpet/blog/projects/swish/api_docs/swish_migration.7.pod',
'level' => 4,
'name' => 'swish_migration.7',
'url' => 'projects/swish/api_docs/swish_migration.7'
}, 'PodBlog::Model::Menu::Entry' ),
bless( {
'dir' => 0,
'file' => '/home/karpet/blog/projects/swish/api_docs/swish_words.1.pod',
'level' => 4,
'name' => 'swish_words.1',
'url' => 'projects/swish/api_docs/swish_words.1'
}, 'PodBlog::Model::Menu::Entry' ),
bless( {
'dir' => 0,
'file' => '/home/karpet/blog/projects/swish/bindings.pod',
'level' => 3,
'name' => 'bindings',
'url' => 'projects/swish/bindings'
}, 'PodBlog::Model::Menu::Entry' ),
bless( {
'dir' => 0,
'file' => '/home/karpet/blog/projects/swish/cpan100606.pod',
'level' => 3,
'name' => 'cpan100606',
'url' => 'projects/swish/cpan100606'
}, 'PodBlog::Model::Menu::Entry' ),
bless( {
'dir' => 0,
'file' => '/home/karpet/blog/projects/swish/libswish3.pod',
'level' => 3,
'name' => 'libswish3',
'url' => 'projects/swish/libswish3'
}, 'PodBlog::Model::Menu::Entry' ),
bless( {
'dir' => 0,
'file' => '/home/karpet/blog/projects/swish/original_idea.txt',
'level' => 3,
'name' => 'original_idea',
'url' => 'projects/swish/original_idea'
}, 'PodBlog::Model::Menu::Entry' ),
bless( {
'dir' => 0,
'file' => '/home/karpet/blog/projects/swish/progress.txt',
'level' => 3,
'name' => 'progress',
'url' => 'projects/swish/progress'
}, 'PodBlog::Model::Menu::Entry' ),
bless( {
'dir' => 0,
'file' => '/home/karpet/blog/projects/swish/progress2.txt',
'level' => 3,
'name' => 'progress2',
'url' => 'projects/swish/progress2'
}, 'PodBlog::Model::Menu::Entry' ),
bless( {
'dir' => 0,
'file' => '/home/karpet/blog/projects/swish/progress3.txt',
'level' => 3,
'name' => 'progress3',
'url' => 'projects/swish/progress3'
}, 'PodBlog::Model::Menu::Entry' ),
bless( {
'dir' => 0,
'file' => '/home/karpet/blog/projects/swish/swish3Proposal.pod',
'level' => 3,
'name' => 'swish3Proposal',
'url' => 'projects/swish/swish3Proposal'
}, 'PodBlog::Model::Menu::Entry' ),
bless( {
'dir' => 0,
'file' => '/home/karpet/blog/projects/swish/swishprog.pod',
'level' => 3,
'name' => 'swishprog',
'url' => 'projects/swish/swishprog'
}, 'PodBlog::Model::Menu::Entry' ),
bless( {
'dir' => 0,
'file' => '/home/karpet/blog/projects/swish/swishprog2.pod',
'level' => 3,
'name' => 'swishprog2',
'url' => 'projects/swish/swishprog2'
}, 'PodBlog::Model::Menu::Entry' ),
bless( {
'dir' => 0,
'file' => '/home/karpet/blog/projects/swish/tokenizer.txt',
'level' => 3,
'name' => 'tokenizer',
'url' => 'projects/swish/tokenizer'
}, 'PodBlog::Model::Menu::Entry' ),
bless( {
'dir' => 0,
'file' => '/home/karpet/blog/projects/swish/tools.txt',
'level' => 3,
'name' => 'tools',
'url' => 'projects/swish/tools'
}, 'PodBlog::Model::Menu::Entry' ),
bless( {
'dir' => 0,
'file' => '/home/karpet/blog/projects/swish/utf8.notes.pod',
'level' => 3,
'name' => 'utf8.notes',
'url' => 'projects/swish/utf8.notes'
}, 'PodBlog::Model::Menu::Entry' ),
bless( {
'dir' => 0,
'file' => '/home/karpet/blog/projects/swish/whySwish3.pod',
'level' => 3,
'name' => 'whySwish3',
'url' => 'projects/swish/whySwish3'
}, 'PodBlog::Model::Menu::Entry' ),
bless( {
'dir' => 0,
'file' => '/home/karpet/blog/projects/swish/xapian10.txt',
'level' => 3,
'name' => 'xapian10',
'url' => 'projects/swish/xapian10'
}, 'PodBlog::Model::Menu::Entry' ),
bless( {
'dir' => 0,
'file' => '/home/karpet/blog/projects/swished.txt',
'level' => 2,
'name' => 'swished',
'url' => 'projects/swished'
}, 'PodBlog::Model::Menu::Entry' ),
bless( {
'dir' => 0,
'file' => '/home/karpet/blog/projects/texttools.txt',
'level' => 2,
'name' => 'texttools',
'url' => 'projects/texttools'
}, 'PodBlog::Model::Menu::Entry' ),
bless( {
'dir' => 0,
'file' => '/home/karpet/blog/projects/wrong.txt',
'level' => 2,
'name' => 'wrong',
'url' => 'projects/wrong'
}, 'PodBlog::Model::Menu::Entry' ),
bless( {
'dir' => 0,
'file' => '/home/karpet/blog/projects/xapian.txt',
'level' => 2,
'name' => 'xapian',
'url' => 'projects/xapian'
}, 'PodBlog::Model::Menu::Entry' ),
bless( {
'dir' => 0,
'file' => '/home/karpet/blog/projects/yamllint.txt',
'level' => 2,
'name' => 'yamllint',
'url' => 'projects/yamllint'
}, 'PodBlog::Model::Menu::Entry' ),
bless( {
'dir' => 1,
'file' => '/home/karpet/blog/spam',
'level' => 1,
'name' => 'spam',
'url' => 'spam'
}, 'PodBlog::Model::Menu::Entry' ),
bless( {
'dir' => 1,
'file' => '/home/karpet/blog/stpaulbartour',
'level' => 1,
'name' => 'stpaulbartour',
'url' => 'stpaulbartour'
}, 'PodBlog::Model::Menu::Entry' )
]
};