searching for duplicates

Mitchell Harris harris at tcs.inf.tu-dresden.de
Tue May 31 10:19:27 CEST 2005


It is very easy to check for possible -exact- duplicates of sequences in
the OEIS. The file "stripped.gz" at

  http://www.research.att.com/~njas/sequences/stripped.gz

is already in lex order according to the sequence (not the Anum).
All you need now is to run the unix shell command

  uniq -D -f 1

which outputs exactly those lines that are duplicates after the first 
(Anum) field. This captures sequences that have been mistakenly submitted 
twice, or those that are equivalent for n that are in the database but 
diverge later.

But, is there an easy way to capture those such that one is the -prefix- 
of another? That is, take a file like

A00000a ,1,2,2,
A00000b ,1,2,3,
A00000c ,1,2,3,4,
A00000d ,1,2,3,4,
A00000e ,1,2,3,5,
A00000f ,1,2,4,5,

So I'd want the middle 4 lines to be output. Yes, it is somewhat 
problematic that A00000d and A00000e are definitely not duplicates, but I 
want to know that A00000b is a possible duplicate of them. 

Also, it might be problematic that A00000b, a short sequence, might match
an unwieldy number of sequences. For that, all I can do is hope that it 
doesn't occur that often.

The purpose is to catch entries that were computed/derived from different 
formulas/sources but may turn out to be equivalent. 

I'm drawing a blank as to how to do this prefix check ... easily. I
suppose an awk/perl/whatever script would do but I guess I'm hoping for a
small handful of commands with serendipitous flags (that I don't have to
debug too much). Any ideas?

-- 
Mitch










More information about the SeqFan mailing list