[seqfan] Extended search in OEIS data with GCD reduction

Tue Oct 13 18:31:11 CEST 2020

Dear SeqFans,

I would like to present a more general contribution here.

For a long time I have been thinking about how I can fix a fundamental
weakness in the search function and even in Superseeker for my own use. It
should be known that sequences that are only stored with a common factor
are not found when searching for a sequence in which this factor has been
divided out, or that uses a common factor other than that of the stored
data.

If we assume a hypothetical database entry 30, 70, 80, 100, 140, 160, 240,
then a search for 3,7,8,10,14,16,24 performed with either the normal OEIS
search function or with the Superseeker will not lead to a hit, although it
might be useful or at least interesting to get a reference to 30, 70, 80,
100, 140, ... in this case.

A practical example happened recently when Don Knuth submitted a new
sequence A337274 <https://oeis.org/A337274>. Only a little later did he
find out that there was already an existing sequence A033472
<https://oeis.org/A033472> on this topic, which, with the exception of the
initial terms, was exactly Don's sequence times 2.

So I thought it might be useful if a search could access data in which
common factors were divided out. Because in very many cases there are small
initial terms that make it impossible to extract factors, I decided to
start such a search for factors only with sufficiently large terms, or to
run the search backwards from the end of the sequence. In many cases, it
turns out that even smaller terms have the same factor that is found when
the search e.g. is only carried out for numbers> = 100. The method is
hardly applicable to "tabl" sequences, but structured search in tables is
another huge topic to be adressed.

I tried such a search for common factors with a range 100 <= x <= 2^63,
over the entire data fields of all sequences, and when I found a GCD > 1,
then in a second run the contiguous range was determined in which all
successive numbers were divisible by the factor. If at least 4 consecutive
numbers were found, then the result was output to a file. This file with
more than 50,000 lines (one line per A number) is welcome to look at. It is
saved under

http://www.randomwalk.de/sequences/gcd0.txt

The format of these files is one line per A-number, starting with the
A-number without the "A", then the GCD, the number of divided items and
then the list of data, which is potentially shorter than the original DATA,
due to excluded non-divisible initial terms and very large terms in the
tail of the DATA.

To my support, Georg Fischer carried out a similar search on the basis of
all b-files and, with a slightly different method, and came up with more
than 45,000 hits from sequences with GCD > 1.

Since the situation occurs quite often that sequences have the form
"composite number + 1" or "composite number - 1", I applied the same action
to a total data set that was incremented or decremented by 1.
This means that the prime number sequence A000040 in both variants (p +
1)/2 and (p-1)/2 becomes part of the result lists. In both cases, i.e., the
entire database shifted by plus 1 or minus 1, approx. 50,000 data records
are found, from which a common factor can be divided out. Because I
excluded sequences in which more than one negative number occurs, the
number of sequences that can be reduced by a common factor is probably even
slightly larger.

At the moment, the 3 files are only updated sporadically and not regularly.
The "shifted" results are given in
http://www.randomwalk.de/sequences/gcdminus1.txt
http://www.randomwalk.de/sequences/gcdplus1.txt

As long as I only use it myself, I can generate a current version from
"stripped" at any time. The extraction program for generating the reduced
sequences is a Fortran program that is still in a sketchy state, and there
is still a lot that can be cleaned up towards the production version, but
it works.
In addition to the three files named gcd * .txt, it currently generates a
number of other outputs that have helped me understand the program
function, among them an information file
http://www.randomwalk.de/sequences/stripped_stats.txt
which contains the number of commas (# of data +1) and the number of minus
signs per data line and thus gives a quick overview of how many searchable
terms (excluding b-files) there are for each database entry and whether it
is a "sign" entry.

You can search in the result files with a text editor or with grep.

But my real goal was to use this in an advanced search by program. If you
are looking for a database with GCD that has been divided out, then you
should of course also eliminate common factors from the number series of
the search input.

Since the programming language in which I can create a working program with
access to files and some calculation with the least effort and safely is
still the good old Fortran, I implemented the search function and the
conversion program in this language.
The search function is based on the files mentioned. The source code of the
programs is available in a zip file
http://www.randomwalk.de/sequences/oeisearch.zip

The program is commented so far that it should be easy to implement it
yourself in any other programming language.

The GCD calculation is performed with the subroutine gcdn, which implements
the ACM algorithm 386 by T. Bradley from 1970.
https://dl.acm.org/doi/10.1145/362686.362697
A version that has been trivially adapted to 8-byte integer is at the end
of my Fortran file. The original can be obtained from
http://netlib.org/toms/386.gz

On Linux systems this can be compiled with the Gnu-Fortran compiler
gfortran.

For Windows users I provide a command line executable oeisearch (see link
to readme below how to access and install).

For those who are interested in trying this out, the necessary information
for installation and examples of its use are given in

http://www.randomwalk.de/sequences/oeisearch_readme.txt

I can assure you that it's fun, and I've already made some amazing
discoveries with recently submitted new sequences, often just converting
the input variants (+ -1, GCD) to hits in the original OEIS data before the
GCD-reduced data is accessed at all.

If I can dream a little, then a combination of Superseeker with the
GCD-reduced data would be a significant extension of the hit area.
Superseeker users already know the references to near hits, and then an
extended reference, like
"transformation Txx produces a match in (Axxxxxx (n) -1) / 24" would also
be interpretable.

I look forward to any kind of feedback and hope for a lively discussion.

Best regards

Hugo Pfoertner