[seqfan] Re: help needed with Mediawiki's Lucene-search extension

Charles Greathouse charles.greathouse at case.edu
Thu Dec 17 16:58:27 CET 2009


I also favor using the current search engine, at least as a first
pass.  It provides more functionality than the general MediaWiki
search.  I'd be willing to look at the internals if someone was
interested in combining the two search engines.  But I'm not familiar
with the code for either the current OEIS search or the Lucene search,
so it could take some time!

Charles Greathouse
Analyst/Programmer
Case Western Reserve University

On Wed, Dec 16, 2009 at 11:28 PM, Olivier Gerard
<olivier.gerard at gmail.com> wrote:
> Another possibility is to adapt the current classic search engine to this wiki
> as a sequence search box, keeping the classical media wiki search engine
> as a free text search.
>
> The current classic search was remarquably optimized for the oeis
> format.  On the input side, we can probably make sure it can
> read/index a disk based version of
> the sequence pages that would be produced from the sequences wiki
> pages. On the output side we can adapt its output to produce links and
> snippet of sequences to the new Wiki.
>
> And for the presentation we could split the search result page, with on the top
> the ten or so first sequences matching, and on the bottom the wiki search engine
> results on non sequence pages.  Also with links to expanded results pages, etc.
>
> What I would like is that we maintain the possibility for those like
> me of receiving the results in text only format, with thematic
> operators such as author: etc, and with a very defined format and
> precise error messages.
>
>
> Olivier
>
>
>
> On Thu, Dec 17, 2009 at 02:36, Richard Mathar <mathar at strw.leidenuniv.nl> wrote:
>>
>> njas> Is anyone familiar with the internals of Mediawiki's Lucene-search extension?
>> njas>
>> njas> It turns out that it doesn't handle phrase queries with commas, like
>> njas> "1, 2, 1, 2, 2, 3, 1".  Of course this is terrible for the OEIS!
>>
>> If we look into
>> http://svn.wikimedia.org/svnroot/mediawiki/branches/lucene-search-2.1/src/org/wikimedia/lsearch/analyzers/FastWikiTokenizerEngine.java
>> we see in addToken() that they remove for example commas just before numbers.
>> So a thing like 2,3,4,5 would immediately be turned into 2345.
>>
>> One might also try to remove the special meaning of commas by taking
>> the comma out of the list of tokens in the function isMinorBreak()
>> in the source code (same  FastWikiTokenizerEngine.java )
>>
>> This seems to be a hand-tailored parser (for speed reasons as one
>> of the explanatory pages says) that will be superseded anyway; I fear that
>> any changes at that low level would not survive...
>>
>> The question is: is that Tokenizer actually called?
>>
>> See
>> http://lucene.apache.org/java/2_4_1/queryparsersyntax.html
>> for an overview of the generic parser syntax (which I am not sure whether this
>> is actually in use and how much of this is overloaded...)
>>
>> RJM
>>
>>
>> _______________________________________________
>>
>> Seqfan Mailing list - http://list.seqfan.eu/
>>
>
>
> _______________________________________________
>
> Seqfan Mailing list - http://list.seqfan.eu/
>




More information about the SeqFan mailing list