[seqfan] Re: help needed with Mediawiki's Lucene-search extension

Thu Dec 17 05:28:00 CET 2009

Another possibility is to adapt the current classic search engine to this wiki
as a sequence search box, keeping the classical media wiki search engine
as a free text search.

The current classic search was remarquably optimized for the oeis
format.  On the input side, we can probably make sure it can
read/index a disk based version of
the sequence pages that would be produced from the sequences wiki
pages. On the output side we can adapt its output to produce links and
snippet of sequences to the new Wiki.

And for the presentation we could split the search result page, with on the top
the ten or so first sequences matching, and on the bottom the wiki search engine
results on non sequence pages.  Also with links to expanded results pages, etc.

What I would like is that we maintain the possibility for those like
me of receiving the results in text only format, with thematic
operators such as author: etc, and with a very defined format and
precise error messages.

Olivier

On Thu, Dec 17, 2009 at 02:36, Richard Mathar <mathar at strw.leidenuniv.nl> wrote:
>
> njas> Is anyone familiar with the internals of Mediawiki's Lucene-search extension?
> njas>
> njas> It turns out that it doesn't handle phrase queries with commas, like
> njas> "1, 2, 1, 2, 2, 3, 1".  Of course this is terrible for the OEIS!
>
> If we look into
> http://svn.wikimedia.org/svnroot/mediawiki/branches/lucene-search-2.1/src/org/wikimedia/lsearch/analyzers/FastWikiTokenizerEngine.java
> we see in addToken() that they remove for example commas just before numbers.
> So a thing like 2,3,4,5 would immediately be turned into 2345.
>
> One might also try to remove the special meaning of commas by taking
> the comma out of the list of tokens in the function isMinorBreak()
> in the source code (same  FastWikiTokenizerEngine.java )
>
> This seems to be a hand-tailored parser (for speed reasons as one
> of the explanatory pages says) that will be superseded anyway; I fear that
> any changes at that low level would not survive...
>
> The question is: is that Tokenizer actually called?
>
> See
> http://lucene.apache.org/java/2_4_1/queryparsersyntax.html
> for an overview of the generic parser syntax (which I am not sure whether this
> is actually in use and how much of this is overloaded...)
>
> RJM
>
>
> _______________________________________________
>
> Seqfan Mailing list - http://list.seqfan.eu/
>