[seqfan] Re: help needed with Mediawiki's Lucene-search extension

Thu Dec 17 00:36:06 CET 2009

njas> Is anyone familiar with the internals of Mediawiki's Lucene-search extension?
njas> 
njas> It turns out that it doesn't handle phrase queries with commas, like
njas> "1, 2, 1, 2, 2, 3, 1".  Of course this is terrible for the OEIS!

If we look into
http://svn.wikimedia.org/svnroot/mediawiki/branches/lucene-search-2.1/src/org/wikimedia/lsearch/analyzers/FastWikiTokenizerEngine.java
we see in addToken() that they remove for example commas just before numbers.
So a thing like 2,3,4,5 would immediately be turned into 2345.

One might also try to remove the special meaning of commas by taking
the comma out of the list of tokens in the function isMinorBreak()
in the source code (same  FastWikiTokenizerEngine.java )

This seems to be a hand-tailored parser (for speed reasons as one
of the explanatory pages says) that will be superseded anyway; I fear that
any changes at that low level would not survive...

The question is: is that Tokenizer actually called?

See
http://lucene.apache.org/java/2_4_1/queryparsersyntax.html
for an overview of the generic parser syntax (which I am not sure whether this
is actually in use and how much of this is overloaded...)

RJM