searching for duplicates

Tue May 31 12:37:27 CEST 2005

Mitchell Harris <harris at tcs.inf.tu-dresden.de> wrote:
:
:It is very easy to check for possible -exact- duplicates of sequences in
:the OEIS. The file "stripped.gz" at
[...]
:But, is there an easy way to capture those such that one is the -prefix- 
:of another? That is, take a file like
:
:A00000a ,1,2,2,
:A00000b ,1,2,3,
:A00000c ,1,2,3,4,
:A00000d ,1,2,3,4,
:A00000e ,1,2,3,5,
:A00000f ,1,2,4,5,
:
:So I'd want the middle 4 lines to be output. Yes, it is somewhat 
:problematic that A00000d and A00000e are definitely not duplicates, but I 
:want to know that A00000b is a possible duplicate of them. 

Here's one pretty simple way to do it:
  zcat stripped.gz | perl -nawle 'BEGIN{$p="x"} if ($F[1] =~ /^\Q$p\E/) { print "$n is a prefix of $F[0]" } else { ($n,$p) = @F[0,1] }'

.. which shows me a total of 1703 lines of output with latest stripped.gz,
starting:
A072401 is a prefix of A064873
A051069 is a prefix of A051065
A015582 is a prefix of A100910
...

By keeping one sequence as the $p(revious) until something fails to match
it as a prefix, this ensures the shortest prefix will always be shown,
but won't tell you if there is also a longer prefix. That is, given:
  A00000b ,1,2,3,
  A00000c ,1,2,3,4,
  A00000d ,1,2,3,4,5,
it will tell you that:
  A00000b is a prefix of A00000c
  A00000b is a prefix of A00000d
.. but not that A00000c is also a prefix of A00000d.

:I'm drawing a blank as to how to do this prefix check ... easily. I
:suppose an awk/perl/whatever script would do but I guess I'm hoping for a
:small handful of commands with serendipitous flags (that I don't have to
:debug too much). Any ideas?

I hope me debugging it for you is good enough. :) I'm not aware of
any way to do this just with flags using standard utilities.

Hugo