[seqfan] An idea for automatical clustering of OEIS entries based on their b-file data.

Sun May 20 16:05:25 CEST 2018

Dear SeqFans,

Last month there was discussion on this list about improving the
classification system of OEIS:
http://list.seqfan.eu/pipermail/seqfan/2018-April/018547.html
Neil's conclusion was that we should incrementally improve what we already
have, the OEIS Index entries.

I also agree with Mark Dion: "all taxonomies are flawed systems,
products of subjective attempts to impose objective order on the
elemental chaos of nature", as quoted by Neil, especially when one
stresses the word _subjective_. In that respect, even an automatic
classification system that would be mostly based on various textual
data fields of each entry (as most systems suggested over the
years have been) would just entrench our current biases what each sequence
is about.
E.g., all the journal references might be about applications in
a particular subfield of combinatorics, while the sequence really
might have its most dramatic import in the number theory, but so far
nobody just knows that.

What I suggest now is quite different:

Without making any assumptions about sequences, an algorithm ("AI") should
cluster the OEIS entries based purely on their terms (b-file data).

First I thought something based on just the visual resemblance of the
scatter plots to each other would work, especially after I had
discovered few further examples of divisor-or-multiple permutations
just by their "looks" (another topic of the last month, continued in the
next
mail). But this supposes that the scatter-plots of sequences have
conspicuous features that are easy to compare.

Unfortunately, most number-theoretic sequences look very much alike,
at least to the human eyes. At best there are emanating "rays":

https://oeis.org/A000010/graph
https://oeis.org/A000203/graph
https://oeis.org/A050216/graph
https://oeis.org/A280864/graph

with some variance in the density of rays towards each direction.

(Or they might have even more nondescript looks, like A000005, A001221,
A001222 have).

Of course, there ARE patterns hidden in these sequences, e.g. in the
famous EKG-sequence, but it is just not easily spotted with the usual
plotting options, e.g. from the scatter plot
https://oeis.org/A064413/graph it cannot be seen at all "at that
zoom-level".

On the other hand, when we concern the base-sequences, whatever the
base, including bases like Fibonacci or factorial base, the OEIS
is full of sequences with strikingly fractal scatter plots, see e.g.:
https://oeis.org/A007953/graph or https://oeis.org/A289234/graph or
https://oeis.org/A276445/graph

So I started wondering, why are the plots of "base-sequences" (in most
general sense) so much more conspicuous than the plots of
"number-theoretical sequences", sequences involving divisibility and
the prime signature? But... maybe I was trying to apply wrong sensory
apparatus to the latter ones? Maybe they are not best savoured with the
_eyes_, but with the _ears_? That is, apply Fourier analysis, and
other sound-related signal-processing tricks? And maybe our eyes have
learned to recognize fractal patterns so easily because so much of our
natural environment is fractal?

Thus I reckon, the algorithm (the "AI") should have both "visual" and
"auditory" tools ("2D" and "1D" signal processing) at its disposal.
Of course, internally it will just manipulate b-file data in some way
or another.

Now, many of the most interesting sequences in OEIS have been
generated by a greedy algorithm (although later a more direct formula
is sometimes found). Most of these live in their "own corners",
meaning that it is unlikely that any exact one-to-one transform
applied to them would yield any hits outside of that limited
sector. Also, as far as those sequences are injective (like any
aspiring greedy permutation is, by definition), the other technique
I have been developing is not much of help here (the "OEIS Djinn"
project, based on the partitioning of noninjective sequence a into
distinct equivalence classes, to find cases where b(n) = f(a(n))
apparently holds for and any other sequence b that is present in OEIS
and any N -> Z function f, even unknown ones, not present in OEIS).

In this respect, a clustering algorithm based on "fuzzy" similarities
(whether "visuo-spatial" or "auditory-linear") between the data in
their b-files might find many resemblances even between different variants
of greedy sequences. For example, consider that many Recaman variants
look quite similar, also most of the ordinal transforms, and most of
the random walks have an easily recognizable "feel" to them, even when
the actual paths they follow are far from each other and when the
deltas are not all just +-1 steps.

Of course the system would NOT need to assign any human category to
the plots based on their visual features. E.g., it really doesn't
matter if the system would "know" that a particular scatter plot
resembles Sagrada Familia (A002487) or Petronas towers (A262097), as
long as those sequences will be clustered reasonably near each other.
Because in this case it would be meaningful, as those two sequences are
not visually near each other by coincidence, but are actually based on
a similar mathematical idea. Also, the algorithm wouldn't care at all
whether we
humans have already realized that some sequence is a Recaman-variant
or an ordinal transform.

So, I think here is a low-hanging fruit for anybody who has some
interest and experience in pattern recognition / clustering
algorithms. For starters, one wouldn't need to implement anything on
OEIS-server side, but should do the analysis remotely, based on
b-files, and then share the results with us on the third-party website
that preferably would offer an easily navigable tree/graph structure.

And yes, I still haven't thought about if such techniques are any good
for the great majority of monotonic sequences, "the counting
sequences" loved by combinatoricists. But that is also the field with
most established techniques, all the well-known 1-to-1 transforms.
And if we had a deep-learning AI, it could soon combine all the
transforms, both existing ones and the new ones it has concocted by
itself, to find some really hairy connections. But this is a topic for
another discussion: How such program could avoid making a lots of
trivial conjectures?  Also, it would be nice if we could instruct a
program to find unexpected things like recent Lemke Oliver-Soundararajan
bias, where something against the existing beliefs would be found from
the b-file data. But this supposes that we could somehow formulate
(e.g. with an ontological scheme) "mathematical prejudices" about
how things are. Or maybe there should be a "stupid" AI inventing such
prejudices and a "clever" AI refuting them? But I digress again.

In any case, we need more b-files, much more (for the existing sequences
first).

Best regards,

Antti Karttunen
PS. If you reply, then please include me on CC: line, otherwise I will
not see your reply until the next digest arrives, as the web-portal of
this list seems to have missed most of the recent postings.