searching for duplicates
Mitchell Harris
harris at tcs.inf.tu-dresden.de
Tue May 31 10:19:27 CEST 2005
It is very easy to check for possible -exact- duplicates of sequences in
the OEIS. The file "stripped.gz" at
http://www.research.att.com/~njas/sequences/stripped.gz
is already in lex order according to the sequence (not the Anum).
All you need now is to run the unix shell command
uniq -D -f 1
which outputs exactly those lines that are duplicates after the first
(Anum) field. This captures sequences that have been mistakenly submitted
twice, or those that are equivalent for n that are in the database but
diverge later.
But, is there an easy way to capture those such that one is the -prefix-
of another? That is, take a file like
A00000a ,1,2,2,
A00000b ,1,2,3,
A00000c ,1,2,3,4,
A00000d ,1,2,3,4,
A00000e ,1,2,3,5,
A00000f ,1,2,4,5,
So I'd want the middle 4 lines to be output. Yes, it is somewhat
problematic that A00000d and A00000e are definitely not duplicates, but I
want to know that A00000b is a possible duplicate of them.
Also, it might be problematic that A00000b, a short sequence, might match
an unwieldy number of sequences. For that, all I can do is hope that it
doesn't occur that often.
The purpose is to catch entries that were computed/derived from different
formulas/sources but may turn out to be equivalent.
I'm drawing a blank as to how to do this prefix check ... easily. I
suppose an awk/perl/whatever script would do but I guess I'm hoping for a
small handful of commands with serendipitous flags (that I don't have to
debug too much). Any ideas?
--
Mitch
More information about the SeqFan
mailing list