searching for duplicates

Mitchell Harris harris at tcs.inf.tu-dresden.de
Tue May 31 14:06:56 CEST 2005


On Tue, 31 May 2005 hv at crypt.org wrote:

>Here's one pretty simple way to do it:
>  zcat stripped.gz | perl -nawle 'BEGIN{$p="x"} if ($F[1] =~ /^\Q$p\E/) { print "$n is a prefix of $F[0]" } else { ($n,$p) = @F[0,1] }'
>
>.. which shows me a total of 1703 lines of output with latest stripped.gz,
>starting:
>A072401 is a prefix of A064873
>A051069 is a prefix of A051065
>A015582 is a prefix of A100910
>...

That did it! Excellent! Thanks!

>By keeping one sequence as the $p(revious) until something fails to match
>it as a prefix, this ensures the shortest prefix will always be shown,
>but won't tell you if there is also a longer prefix. That is, given:
>  A00000b ,1,2,3,
>  A00000c ,1,2,3,4,
>  A00000d ,1,2,3,4,5,
>it will tell you that:
>  A00000b is a prefix of A00000c
>  A00000b is a prefix of A00000d
>.. but not that A00000c is also a prefix of A00000d.

That's no problem at all. It'll print out the range, and any further 
nuamces can be resolved by inspection.

>:I'm drawing a blank as to how to do this prefix check ... easily. I
>:suppose an awk/perl/whatever script would do but I guess I'm hoping for a
>:small handful of commands with serendipitous flags (that I don't have to
>:debug too much). Any ideas?
>
>I hope me debugging it for you is good enough. :) I'm not aware of
>any way to do this just with flags using standard utilities.

Thanks for doing my work for me. Perl is fine, I was just hesitating 
trying to figure that out myself.

-- 
Mitch






More information about the SeqFan mailing list