Data Mining EIS

karttu at megabaud.fi karttu at megabaud.fi
Thu Jul 27 02:53:29 CEST 2000


> I agree something like this would be nice.

Yes, but please let's not burden Neil with this.
Instead, everybody can use a script like this, called geteis.csh:

#!/bin/csh
if ("a" == "$1a") then
 echo "Usage: $0 outputfilename (takes more than dozen megabytes)"
 exit 1
endif
set a=0
luuppi:
if ($a < 10) set a=0$a
echo "Dumping http://www.research.att.com/~njas/sequences/eisBTfry000$a.html"
lynx -dump http://www.research.att.com/~njas/sequences/eisBTfry000$a.html > eis$a.$$
if (`fgrep -c 'Document Not Found' eis$a.$$` > 0) then
 echo "/~njas/sequences/eisBTfry000$a.html not found anymore, finishing."
 rm -f eis$a.$$
 exit 0
endif
perl getseqs.pl < eis$a.$$ >> $1
rm -f eis$a.$$
set a=`expr $a + 1`
goto luuppi


which in turn feeds each received batch of sequences to getseqs.pl Perl-script:

#!/usr/bin/perl
#
# Usage:
# cat a*.html | perl getseqs.pl
#
# The lines that interest us have the following format:
#
# %S A034825 0,1,1,2,4,9,20,48,115,286,718,1832,4702,12159,31515,81888,
# %T A034825 212878,553557,1438741,3737331,9700188,25156049,65181067,
# %U A034825 168746672,436505846,1128256918,2914103577,7521450053,19400577711
# %N A034825 n-node rooted trees of height at most 8.
# %A A034825 njas
#
#
sub scan_input
{
    my($file) = @_;
    my($seqnum,$rest,$list_ended);
    $list_ended = 1;

    while($_ = <$file>)
     {
       if(/^\%S ([^ ]*) /o)
        {
          $seqnum = $1;
          $rest = $';
          chop($rest);
          if(!$list_ended) { print "];"; $list_ended = 1; }
          print "\n$1 := [$rest";
          $list_ended = 0;
        }
       elsif(("" ne $seqnum) && /^\%T ([^ ]*) /o)
        {
          if($1 ne $seqnum)
           {
             warn("Sequence number changed incoherently from $seqnum to $1, the rest is ignored: $'\n");
             print "]; #Sequence number changed incoherently from $seqnum to $1, the rest is ignored: $'\n";
             $seqnum = "";
           }
          else
           {
             $rest = $';
             chop($rest);
             print "$rest";
           }
        }
       elsif(("" ne $seqnum) && /^\%U ([^ ]*) /o)
        {
          if($1 ne $seqnum)
           {
             warn("Sequence number changed incoherently from $seqnum to $1, the rest is ignored: $'\n");
             print "]; #Sequence number changed incoherently from $seqnum to $1, the rest is ignored: $'\n";
           }
          else
           {
	     $rest = $';
             chop($rest);
             print "$rest";
           }
        }
       elsif(/^\%N ([^ ]*) /o && ($1 eq $seqnum))
        {
           {
             $rest = $';
             chop($rest);
             print "]; # $rest";
             $list_ended = 1;
           }
        }
       elsif(/^\%A ([^ ]*) /o && ($1 eq $seqnum))
        {
           {
             $rest = $';
             chop($rest);
             if(!$list_ended) { print "]; #"; $list_ended = 1; }
             print " A: $rest";
           }
        }
     }
    print "\n";
}

&scan_input(*STDIN);


And these two scripts
(can be found also with URLs
http://www.megabaud.fi/~karttu/matikka/geteis.csh
and
http://www.megabaud.fi/~karttu/matikka/getseqs.pl
), when started e.g. like ./geteis.csh eisjul27.txt
will fetch the whole current published portion of EIS
(provided that nobody has added a string "Document Not Found" 
onto the Comment lines...), and format it like this:

A000004 := [0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]; # The zero sequence. A: njas
A023976 := [0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1]; # First bit in fractional part of binary expansion of 9-th root of n. A: njas, Olivier Gerard (ogerard at ext.jussieu.fr)
A025469 := [0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,]; # Partitions of n into 3 distinct positive cubes. A: dww
...
A056061 := [1,1,1,1,1,2,1,1,2,4,1,2,2,2,2,2,1,2,1,2,2,2,1,2,4,4,8,8,4,6,2,2,2,4,4,4,4,2,4,2,2,2,2,8,12,4,8,8,8,8,8,4,6,2,2,2,3,2,3,3,3,4,4,2,4,2,4,44,1,2,2,2,4,4,8,12,2,4,12,12,4,4,8,12,12,12,4,6,8,12,12,12,8,16,8,8,]; # Number of square divisors of central binomial coefficients. A: Labos E. (labos"ana1.sote.hu), Jul 26 2000

(The whole size of that was 14 megabytes.)


The stuff above is ready to be inputted to Maple, for example.
Now, anybody can edit above getseqs.pl to his taste, i.e. take
%V, %W and %X lines instead of %S, %T and %U, in case the
sequence is signed, skip dead sequences, count the occurrences
of the integers 0 - 10000, and see what is the minimum term
not yet explicitly listed, etc.

In case your Unix platform doesn't have lynx installed, use wget,
which can be found from
http://www.gnu.org/software/wget/wget.html


Terveisin,
  Yours,

Antti Karttunen









More information about the SeqFan mailing list