[seqfan] A metric and UI for Pandora-like OEIS exploration

Robert Munafo mrob27 at gmail.com
Sun Oct 25 23:18:16 CET 2009


I have begun working on this now.

The overall structure of the implementation is the following:

  A *database* of items (sequences in our case)

  A *metric* algorithm for measuring the cognitive distance between any two
items

  A user interface for
     creating a *station* (corresponding to a particular area of interest,
or a field of research)
     auto-*recommending* items to the user
     User-*rating* items as "thumbs up", "thumbs down", or no comment
(ambivalent)

The implementation I am creating consists of a command-line interface with a
compiled tool that implements the metric. It also uses a local copy of the
database (in the form of eisBTfry00000.txt files, which are no longer
available, but I have a 2-year old version I downloaded in 2007).

*The Metric*

The "metric" is the algorithm for determining how closely related two
sequences are. It is central to any assisted browsing system.

For those who don't know, the Pandora database uses real people to create a
"genome" for each song; a "genome" is essentially a set of cognitive
attribute tags drawn from hundreds (possibly now thousands) of different
possible attributes. This is necessary partly because music has to be
listened to somehow to discover its cognitive content, and computers aren't
too good at that yet.

In our case the job is made easier by the fact that the EIS entries are
essentially just plain text, and the text itself is about the right size to
use as the genome.

So I have chosen a *character-level concordance metric* as the initial
algorithm for measuring similarity of sequences. My specific implementation
works as follows:
   Go through the first source text, compiling a list of all 5-byte
substrings that appear and their frequencies. Go through the second source
text and compile the same statistics for it. Thus for each string we have a
count for A and another count for B.
   For each substring we then redistribute the counts into an "AB count". AB
equals the lesser of A and B, and A and B are then both decreased by this
amount. For example, if "facto" appears three times in source text A and two
times in source text B, then the initial scan produces A{"facto"}=3 and
B{"facto"}=2; this is then adjusted to produce A{"facto"}=1, AB{"facto"}=2
and B{"facto"}=0.
   Then we compute the *"coverage*" of A over B, which measures how much
source A "covers" source B. This is simply the total of all AB counts
divided by the sum of AB's and B's. So for example, if source B contains
1004 bytes (and thus 1000 5-character strings), and if 700 of these
5-character strings occur in A, then A's coverage of B is 0.7 (or 70%). B's
coverage of A is computed similarly.
   The "*concordance*" between A and B is the lesser of the two coverage
values.

*Channel State*

The user's browsing is centered around "channels", which represent distinct
areas of interest. (We need another name for this, I'm just using "channel"
to help readers understand).
   Each channel has a datafile that contains its current state. This state
consists of:
   - A list of every sequence in the OEIS, with a score telling how well
that sequence fits this channel
   - A list of every sequence that has been recommended to the user so far
   - A flag for each recommendation telling whether it was given "thumbs up"
or "thumbs down" or neither.

*User Interface
*
The user begins by creating a "new channel" representing an area of
research. To create a channel the user must select a single relevant
sequence. This is the "seed sequence".

The program computes the concordance match between the seed sequence and all
other sequences, and stores this information in a file representing the
current state of that channel.

Then the program selects the sequence that has the highest concordance with
the seed sequence, and presents it to the user. (In MacOS, I can open the
sequence's web page into a broswer).

The user can then perform the following actions:
  - Next: give me another automatic recommendation
  - Thumbs up: this sequence belongs on this channel
  - Thumbs down: this sequence does not belong on this channel
  - Add another "seed sequence".

When the user selects Next, another sequence is recommended and no other
actions are taken.

When the user selects Thumbs Up or Thumbs Down, the software performs the
concordance metric between the new sequence and all other sequences in the
database. The resulting concordance score is either added to or subtracted
from the current scores for each sequence, and the channel state is updated.

The action of adding another seed sequence is treated the same way as
"Thumbs up" except using the user's choice rather than the
auto-recommendation as the sequence to compute metrics from.

I haven't finished this yet so I haven't found out whether the
recommendations will actually be any good (-:
*
Improvements*

Once I get this working, I'll consider improvements to the metric. At first
glance, it seems like it would be a good idea to give certain lines (like %Y
and %E) less importance and others (like %S and %T) more. This can be done
pretty easily within my current concordance implementation.

It is relatively easy to "undo" a Thumbs Up or Thumbs Down rating, so I'll
probably implement that next.

When new OEIS entries appear on the server, the channel needs to be updated.
If the updated sequence was not part of the channel's definition (i.e. was
never given thumbs up or thumbs down) then all we need to do is update that
sequence's score (the sum of its concordance against all thumbs-up items
minus its concordance against all thumbs-down items). If the updated
sequence IS one of the defining sequences for the channel, then the process
takes longer: First "undo" the sequence's rating; then "re-rate" the
sequence using the new text.

From: "N. J. A. Sloane" <njas at research.att.com>
>
> Rick - that's a very interesting suggestion!   I will pass it on the
> associate editors, bacause not all of them are on this list.
>
__

 From: Jonathan Post <jvospost3 at gmail.com>
>
>  Extremely interesting suggestion, with many subtle consequences!
>
__

From: Rick Shepherd <rlshepherd2 at gmail.com>
> Subject: [seqfan] Guided browsing of the OEIS based upon personal
>       preferences?
>
> Bear with me, I'm not really being off-topic (although this may be
> interesting to some people strictly for music-related reasons).
> [...]
> Today I ran across an article describing a system called Pandora for
> suggesting songs one may like based upon one's previous statements of
> preferred songs. [...]
> Here's the link, "The Song Decoders":
> http://www.nytimes.com/2009/10/18/magazine/18Pandora-t.html
> [...]
>

-- 
 Robert Munafo  --  mrob.com



More information about the SeqFan mailing list