$VAR1 = {
'blog' => [
bless( {
'file' => '/home/karpet/blog/projects/swish/utf8.notes.pod',
'format' => 'pod',
'id' => 'utf8.notes.pod',
'mtime' => 1173884309,
'name' => 'utf8.notes',
'text' => '=head1 UTF-8 Research
Multibyte encoding support is one of the big Swish3 features. I had what I
thought was a workable framework for it using the C99 standard C<wchar_t> wide
character functions. But I\'ve been disillusioned (which is usually a Good Thing)
the last couple days based on some reading I\'ve been doing on the linux-utf8 list
archives.
Namely this L<http://thread.gmane.org/gmane.comp.internationalization.linux/3758/focus=3769>
thread burst my bubble. But in a Good Way.
Seems C<wchar_t> is not portable. Particularly on Windows (L<http://www.mail-archive.com/bug-autoconf@gnu.org/msg00648.html>) where it is defined as 16-bit rather than the full 32 bits required
to represent the entire UTF-8 charset.
So I\'ve been googling for other C libraries out there to help me. I need basically two kinds
of functions:
=over
=item C<utf8_tolower( xmlChar * mixedCaseStr )>
All the metanames and propertynames need to be normalized against parsed tagnames.
So need to have this. Also, all strings are normalized to lowercase before tokenizing.
That\'s just a IR Good Practice.
=item C<tokenize_utf8_string( xmlChar * utf8_string )>
Split up a string into I<words>. This really should be language-aware, but at the very
least it needs to recognize what\'s alpha vs whitespace vs punctuation, etc.
=back
Both types of functions are crucial to the kind of string wrangling Swish3 needs to do.
Along the way, it would be nice to build in a portable UTF-8-aware regular
expression library, since that would make for a nice flexible way of configuring
WordTokenPattern rather than needing to write a whole C function to tokenize
a string into words. This kind of regexp support is provided via general C regexp
in Swish-e, which isn\'t UTF-8 aware I believe, or optionally via PCRE (Perl Compatable
Regular Expressions) which (Dave tells me) isn\'t developed for Win32 anymore.
So it would be nice, though not crucial, to get UTF-8-happy regex support if I can
as well.
Google says many things about UTF-8 and i18n, but what it doesn\'t tell me is what I
should do. :)
Here are some things I\'ve found:
=over
=item L<http://www.geocities.jp/kosako3/oniguruma/>
UTF-8 regexp library. Actually supports lots of different encodings, which I don\'t really
need.
=item L<http://icu.sourceforge.net/apiref/>
The ICU from IBM. Big blue and formidable. Way More than I need. Not going to use this one.
=item L<http://www.haible.de/bruno/packages-libutf8.html>
Makes C<wchar_t> portable. Kind of. But the author is the one who disillusioned me in
that mail thread I mention above. So I\'m not going down that road anymore.
This library doesn\'t appear to be supported any more.
=item L<http://crl.nmsu.edu/~mleisher/ucdata.html>
Could be a starting point if I need to roll my own. Does UTF-8 to UC-4 conversion,
so doesn\'t depend on C<wchar_t>. Has some UTF-8 functions, including a tolower().
=item L<http://xmlsoft.org/html/libxml-xmlunicode.html>
As Bill reminded me, libxml2 has unicode stuff in it too. Not exactly what I need, but
could be a starting place. And it fits with my increasing sense of using libxml2 to
do everything.
=back
=head2 What\'s in a Word?
Word tokenization is the big issue. Swish-e tokenizes through a 256-byte lookup table: it\'s
perfect for 8bit encodings because it is fast and easy to understand. But the Hawker Observation applies:
you need to seriously rethink the algorithm you\'re using every time you increase your data set by two orders of magnitude. (L<http://alumnus.caltech.edu/~copeland/work/i18n-b.html>)
That\'s why a regexp library or something else with predefined Unicode character tables
is necessary. It\'s a wheel that\'s been invented. What I\'m struggling with tonight is *which*
wheel to use: what\'s easy to implement, well supported and proven, and will be something
I can use now and trust a year from now.
',
'url' => 'projects/swish/utf8.notes'
}, 'PodBlog::Model::Blog::Entry' )
],
'menu' => [
bless( {
'dir' => 1,
'file' => '/home/karpet/blog/books',
'level' => 1,
'name' => 'books',
'url' => 'books'
}, 'PodBlog::Model::Menu::Entry' ),
bless( {
'dir' => 1,
'file' => '/home/karpet/blog/general',
'level' => 1,
'name' => 'general',
'url' => 'general'
}, 'PodBlog::Model::Menu::Entry' ),
bless( {
'dir' => 1,
'file' => '/home/karpet/blog/projects',
'level' => 1,
'name' => 'projects',
'url' => 'projects'
}, 'PodBlog::Model::Menu::Entry' ),
bless( {
'dir' => 0,
'file' => '/home/karpet/blog/projects/_intro.txt',
'level' => 2,
'name' => '_intro',
'url' => 'projects/_intro'
}, 'PodBlog::Model::Menu::Entry' ),
bless( {
'dir' => 0,
'file' => '/home/karpet/blog/projects/ajax.txt',
'level' => 2,
'name' => 'ajax',
'url' => 'projects/ajax'
}, 'PodBlog::Model::Menu::Entry' ),
bless( {
'dir' => 0,
'file' => '/home/karpet/blog/projects/blas.txt',
'level' => 2,
'name' => 'blas',
'url' => 'projects/blas'
}, 'PodBlog::Model::Menu::Entry' ),
bless( {
'dir' => 0,
'file' => '/home/karpet/blog/projects/blue.txt',
'level' => 2,
'name' => 'blue',
'url' => 'projects/blue'
}, 'PodBlog::Model::Menu::Entry' ),
bless( {
'dir' => 0,
'file' => '/home/karpet/blog/projects/bug_or_feature.txt',
'level' => 2,
'name' => 'bug_or_feature',
'url' => 'projects/bug_or_feature'
}, 'PodBlog::Model::Menu::Entry' ),
bless( {
'dir' => 0,
'file' => '/home/karpet/blog/projects/catalyst.txt',
'level' => 2,
'name' => 'catalyst',
'url' => 'projects/catalyst'
}, 'PodBlog::Model::Menu::Entry' ),
bless( {
'dir' => 0,
'file' => '/home/karpet/blog/projects/chars.txt',
'level' => 2,
'name' => 'chars',
'url' => 'projects/chars'
}, 'PodBlog::Model::Menu::Entry' ),
bless( {
'dir' => 0,
'file' => '/home/karpet/blog/projects/cpan.txt',
'level' => 2,
'name' => 'cpan',
'url' => 'projects/cpan'
}, 'PodBlog::Model::Menu::Entry' ),
bless( {
'dir' => 0,
'file' => '/home/karpet/blog/projects/craydoc.txt',
'level' => 2,
'name' => 'craydoc',
'url' => 'projects/craydoc'
}, 'PodBlog::Model::Menu::Entry' ),
bless( {
'dir' => 0,
'file' => '/home/karpet/blog/projects/cssprint.txt',
'level' => 2,
'name' => 'cssprint',
'url' => 'projects/cssprint'
}, 'PodBlog::Model::Menu::Entry' ),
bless( {
'dir' => 0,
'file' => '/home/karpet/blog/projects/fp_talk1.txt',
'level' => 2,
'name' => 'fp_talk1',
'url' => 'projects/fp_talk1'
}, 'PodBlog::Model::Menu::Entry' ),
bless( {
'dir' => 0,
'file' => '/home/karpet/blog/projects/fp_talk2.txt',
'level' => 2,
'name' => 'fp_talk2',
'url' => 'projects/fp_talk2'
}, 'PodBlog::Model::Menu::Entry' ),
bless( {
'dir' => 0,
'file' => '/home/karpet/blog/projects/fp_talk3.txt',
'level' => 2,
'name' => 'fp_talk3',
'url' => 'projects/fp_talk3'
}, 'PodBlog::Model::Menu::Entry' ),
bless( {
'dir' => 0,
'file' => '/home/karpet/blog/projects/fp_talks.txt',
'level' => 2,
'name' => 'fp_talks',
'url' => 'projects/fp_talks'
}, 'PodBlog::Model::Menu::Entry' ),
bless( {
'dir' => 0,
'file' => '/home/karpet/blog/projects/frozenperl.txt',
'level' => 2,
'name' => 'frozenperl',
'url' => 'projects/frozenperl'
}, 'PodBlog::Model::Menu::Entry' ),
bless( {
'dir' => 0,
'file' => '/home/karpet/blog/projects/hacker.txt',
'level' => 2,
'name' => 'hacker',
'url' => 'projects/hacker'
}, 'PodBlog::Model::Menu::Entry' ),
bless( {
'dir' => 0,
'file' => '/home/karpet/blog/projects/hiliter.txt',
'level' => 2,
'name' => 'hiliter',
'url' => 'projects/hiliter'
}, 'PodBlog::Model::Menu::Entry' ),
bless( {
'dir' => 0,
'file' => '/home/karpet/blog/projects/http_flow.txt',
'level' => 2,
'name' => 'http_flow',
'url' => 'projects/http_flow'
}, 'PodBlog::Model::Menu::Entry' ),
bless( {
'dir' => 0,
'file' => '/home/karpet/blog/projects/ibmunicode.txt',
'level' => 2,
'name' => 'ibmunicode',
'url' => 'projects/ibmunicode'
}, 'PodBlog::Model::Menu::Entry' ),
bless( {
'dir' => 0,
'file' => '/home/karpet/blog/projects/ideas.txt',
'level' => 2,
'name' => 'ideas',
'url' => 'projects/ideas'
}, 'PodBlog::Model::Menu::Entry' ),
bless( {
'dir' => 0,
'file' => '/home/karpet/blog/projects/ishida.txt',
'level' => 2,
'name' => 'ishida',
'url' => 'projects/ishida'
}, 'PodBlog::Model::Menu::Entry' ),
bless( {
'dir' => 0,
'file' => '/home/karpet/blog/projects/iterm.txt',
'level' => 2,
'name' => 'iterm',
'url' => 'projects/iterm'
}, 'PodBlog::Model::Menu::Entry' ),
bless( {
'dir' => 0,
'file' => '/home/karpet/blog/projects/larry_pm.txt',
'level' => 2,
'name' => 'larry_pm',
'url' => 'projects/larry_pm'
}, 'PodBlog::Model::Menu::Entry' ),
bless( {
'dir' => 0,
'file' => '/home/karpet/blog/projects/latenights.txt',
'level' => 2,
'name' => 'latenights',
'url' => 'projects/latenights'
}, 'PodBlog::Model::Menu::Entry' ),
bless( {
'dir' => 0,
'file' => '/home/karpet/blog/projects/long_live_perl.txt',
'level' => 2,
'name' => 'long_live_perl',
'url' => 'projects/long_live_perl'
}, 'PodBlog::Model::Menu::Entry' ),
bless( {
'dir' => 0,
'file' => '/home/karpet/blog/projects/memory.txt',
'level' => 2,
'name' => 'memory',
'url' => 'projects/memory'
}, 'PodBlog::Model::Menu::Entry' ),
bless( {
'dir' => 0,
'file' => '/home/karpet/blog/projects/mylibrary.txt',
'level' => 2,
'name' => 'mylibrary',
'url' => 'projects/mylibrary'
}, 'PodBlog::Model::Menu::Entry' ),
bless( {
'dir' => 0,
'file' => '/home/karpet/blog/projects/perlisalive.txt',
'level' => 2,
'name' => 'perlisalive',
'url' => 'projects/perlisalive'
}, 'PodBlog::Model::Menu::Entry' ),
bless( {
'dir' => 0,
'file' => '/home/karpet/blog/projects/postgresql_on_osx.txt',
'level' => 2,
'name' => 'postgresql_on_osx',
'url' => 'projects/postgresql_on_osx'
}, 'PodBlog::Model::Menu::Entry' ),
bless( {
'dir' => 0,
'file' => '/home/karpet/blog/projects/profiling_perl.txt',
'level' => 2,
'name' => 'profiling_perl',
'url' => 'projects/profiling_perl'
}, 'PodBlog::Model::Menu::Entry' ),
bless( {
'dir' => 0,
'file' => '/home/karpet/blog/projects/rest.txt',
'level' => 2,
'name' => 'rest',
'url' => 'projects/rest'
}, 'PodBlog::Model::Menu::Entry' ),
bless( {
'dir' => 0,
'file' => '/home/karpet/blog/projects/ror.txt',
'level' => 2,
'name' => 'ror',
'url' => 'projects/ror'
}, 'PodBlog::Model::Menu::Entry' ),
bless( {
'dir' => 0,
'file' => '/home/karpet/blog/projects/stateofsearch.txt',
'level' => 2,
'name' => 'stateofsearch',
'url' => 'projects/stateofsearch'
}, 'PodBlog::Model::Menu::Entry' ),
bless( {
'dir' => 1,
'file' => '/home/karpet/blog/projects/swish',
'level' => 2,
'name' => 'swish',
'url' => 'projects/swish'
}, 'PodBlog::Model::Menu::Entry' ),
bless( {
'dir' => 1,
'file' => '/home/karpet/blog/projects/swish/api_docs',
'level' => 3,
'name' => 'api_docs',
'url' => 'projects/swish/api_docs'
}, 'PodBlog::Model::Menu::Entry' ),
bless( {
'dir' => 0,
'file' => '/home/karpet/blog/projects/swish/bindings.pod',
'level' => 3,
'name' => 'bindings',
'url' => 'projects/swish/bindings'
}, 'PodBlog::Model::Menu::Entry' ),
bless( {
'dir' => 0,
'file' => '/home/karpet/blog/projects/swish/cpan100606.pod',
'level' => 3,
'name' => 'cpan100606',
'url' => 'projects/swish/cpan100606'
}, 'PodBlog::Model::Menu::Entry' ),
bless( {
'dir' => 0,
'file' => '/home/karpet/blog/projects/swish/libswish3.pod',
'level' => 3,
'name' => 'libswish3',
'url' => 'projects/swish/libswish3'
}, 'PodBlog::Model::Menu::Entry' ),
bless( {
'dir' => 0,
'file' => '/home/karpet/blog/projects/swish/original_idea.txt',
'level' => 3,
'name' => 'original_idea',
'url' => 'projects/swish/original_idea'
}, 'PodBlog::Model::Menu::Entry' ),
bless( {
'dir' => 0,
'file' => '/home/karpet/blog/projects/swish/progress.txt',
'level' => 3,
'name' => 'progress',
'url' => 'projects/swish/progress'
}, 'PodBlog::Model::Menu::Entry' ),
bless( {
'dir' => 0,
'file' => '/home/karpet/blog/projects/swish/progress2.txt',
'level' => 3,
'name' => 'progress2',
'url' => 'projects/swish/progress2'
}, 'PodBlog::Model::Menu::Entry' ),
bless( {
'dir' => 0,
'file' => '/home/karpet/blog/projects/swish/progress3.txt',
'level' => 3,
'name' => 'progress3',
'url' => 'projects/swish/progress3'
}, 'PodBlog::Model::Menu::Entry' ),
bless( {
'dir' => 0,
'file' => '/home/karpet/blog/projects/swish/swish3Proposal.pod',
'level' => 3,
'name' => 'swish3Proposal',
'url' => 'projects/swish/swish3Proposal'
}, 'PodBlog::Model::Menu::Entry' ),
bless( {
'dir' => 0,
'file' => '/home/karpet/blog/projects/swish/swishprog.pod',
'level' => 3,
'name' => 'swishprog',
'url' => 'projects/swish/swishprog'
}, 'PodBlog::Model::Menu::Entry' ),
bless( {
'dir' => 0,
'file' => '/home/karpet/blog/projects/swish/swishprog2.pod',
'level' => 3,
'name' => 'swishprog2',
'url' => 'projects/swish/swishprog2'
}, 'PodBlog::Model::Menu::Entry' ),
bless( {
'dir' => 0,
'file' => '/home/karpet/blog/projects/swish/tokenizer.txt',
'level' => 3,
'name' => 'tokenizer',
'url' => 'projects/swish/tokenizer'
}, 'PodBlog::Model::Menu::Entry' ),
bless( {
'dir' => 0,
'file' => '/home/karpet/blog/projects/swish/tools.txt',
'level' => 3,
'name' => 'tools',
'url' => 'projects/swish/tools'
}, 'PodBlog::Model::Menu::Entry' ),
bless( {
'dir' => 0,
'file' => '/home/karpet/blog/projects/swish/utf8.notes.pod',
'level' => 3,
'name' => 'utf8.notes',
'url' => 'projects/swish/utf8.notes'
}, 'PodBlog::Model::Menu::Entry' ),
bless( {
'dir' => 0,
'file' => '/home/karpet/blog/projects/swish/whySwish3.pod',
'level' => 3,
'name' => 'whySwish3',
'url' => 'projects/swish/whySwish3'
}, 'PodBlog::Model::Menu::Entry' ),
bless( {
'dir' => 0,
'file' => '/home/karpet/blog/projects/swish/xapian10.txt',
'level' => 3,
'name' => 'xapian10',
'url' => 'projects/swish/xapian10'
}, 'PodBlog::Model::Menu::Entry' ),
bless( {
'dir' => 0,
'file' => '/home/karpet/blog/projects/swished.txt',
'level' => 2,
'name' => 'swished',
'url' => 'projects/swished'
}, 'PodBlog::Model::Menu::Entry' ),
bless( {
'dir' => 0,
'file' => '/home/karpet/blog/projects/texttools.txt',
'level' => 2,
'name' => 'texttools',
'url' => 'projects/texttools'
}, 'PodBlog::Model::Menu::Entry' ),
bless( {
'dir' => 0,
'file' => '/home/karpet/blog/projects/wrong.txt',
'level' => 2,
'name' => 'wrong',
'url' => 'projects/wrong'
}, 'PodBlog::Model::Menu::Entry' ),
bless( {
'dir' => 0,
'file' => '/home/karpet/blog/projects/xapian.txt',
'level' => 2,
'name' => 'xapian',
'url' => 'projects/xapian'
}, 'PodBlog::Model::Menu::Entry' ),
bless( {
'dir' => 0,
'file' => '/home/karpet/blog/projects/yamllint.txt',
'level' => 2,
'name' => 'yamllint',
'url' => 'projects/yamllint'
}, 'PodBlog::Model::Menu::Entry' ),
bless( {
'dir' => 1,
'file' => '/home/karpet/blog/spam',
'level' => 1,
'name' => 'spam',
'url' => 'spam'
}, 'PodBlog::Model::Menu::Entry' ),
bless( {
'dir' => 1,
'file' => '/home/karpet/blog/stpaulbartour',
'level' => 1,
'name' => 'stpaulbartour',
'url' => 'stpaulbartour'
}, 'PodBlog::Model::Menu::Entry' )
]
};