Longest common subsequences: summary so far
N. J. A. Sloane
njas at research.att.com
Mon Jun 14 18:01:46 CEST 2004
Longest common subsequences: summary so far
The following is an edited record of the correspondence on this problem so far.
NJAS Jun 14 2004
***************************************************************************************
>From njas at research.att.com Fri Jun 11 23:40:54 2004
Date: Fri, 11 Jun 2004 23:40:53 -0400 (EDT)
From: "N. J. A. Sloane" <njas at research.att.com>
Message-Id: <200406120340.XAA13275 at fry.research.att.com>
To: math-fun at mailman.xmission.com, seqfan at ext.jussieu.fr
Subject: number of longest common subsequences problem
I ran into my old friend Al Aho today for
the first time in many years, and he mentioned
the following possibly unsolved problem.
Given two strings X and Y of length n from some fixed
alphabet, let f(X,Y) = number of longest common subsequences of X and Y.
What is a(n) = max f(X,Y) over all choices of X and Y?
Remarks:
S is a subsequence of X if S can be
obtained by deleting some of the entries of X
(not necessarily adjacent).
You can choose any alphabet you like.
Here is a definition of "longest common subsequence" from
http://www.cs.sunysb.edu/~algorith/files/longest-common-substring.shtml
(see also http://www.nist.gov/dads/HTML/longestCommonSubstring.html)
Given two strings X and Y,
what is the longest sequence S such that the characters of S appear
as a scattered subsequence of both X and Y?
Then the question is, how many different longest common subsequences
can occur?
There are two versions of the problem, depending on whether
we take into account the location of S inside X and Y (Version 1)
or not (Version 2).
NJAS
***************************************************************************************
>From hv at crypt.org Sat Jun 12 10:25:07 2004
Cc: math-fun at mailman.xmission.com, seqfan at ext.jussieu.fr
Subject: Re: number of longest common subsequences problem
Here you introduce the word "different" for the first time. Are two
subsequences "different" if the subsequences themselves differ, or is it
sufficient for the letters of the subsequence to be drawn from different
positions in X or Y? Ie given:
X = "aabaa", Y = "aaabb"
are there 1 or 4 occurrences of the LCS "aaa"?
Hugo van der Sanden
My reply:
There are 2 versions of the problem. In Version 1 the answer
would be 4, in Version 2 the answer would be 1.
For Version 1:
In your example, the LCS has length 3, and
there seem to be 10 ways to get 3:
X = aabaa Y = aaabb
---------------------
aab.. aa.b.
aab.. aa..b
aab.. a.ab.
aab.. a.a.b
aab.. .aab.
aab.. .aa.b
aa.a. aaa..
aa..a aaa..
a..aa aaa..
.a.aa aaa..
so a(5) >= 10.
NJAS
***************************************************************************************
>From Sterten at aol.com Sat Jun 12 11:15:42 2004
Date: Sat, 12 Jun 2004 11:15:06 EDT
Subject: re: number of longest common subsequences problem
let m be the size of the alphabet.
I get : (n=1,2,...)
m=2:1,2,2,4,4,6,8,11,15,20,
m=3:1,2,3,4,6,10,
m=4:1,2,3,4,7,
m=5:1,2,3,4,7,
I assume that the maximum is usually attained for
X=123..m123..m... , except for some small values of n.
fixing X this way gives:
m=2:1,2,2,3,4,6,7,10,14,20,26,36,51,70,96,141,
m=3:1,2,3,4,6,10,14,24,36,58,
m=4:1,2,3,4,7,10,19,28,51,
m=5:1,2,3,4,6,10,15,28,
m=6:1,2,3,4,6,9,14,
with hillclimbing my best values were:
m=2:1,2,2,4,4,6,8,11,15,20,26,36,50,70,96,141,192,
m=3:1,2,3,4,6,10,14,24,36,58,88,145,
m=4:1,2,3,4,7,10,19,28,51,78,141,220,
m=5:1,2,3,4,7,10,19,28,44,69,112,186,
my lower bounds for m=5 are obviously bad, or there could
even be a bug in my program.
I didn't find the above sequences in the OEIS
Guenter
Me: I added some of them:
%I A094858
%S A094858 1,2,2,4,4,6,8,11,15,20
%N A094858 Maximal number of longest common subsequences between any two binary strings of length n (Version 2).
%C A094858 Definitions: S is a subsequence of X if S can be obtained by deleting some (not necessarily adjacent) entries of X.
%C A094858 S is a longest common subsequence of X and Y if S is a subsequence of X, S is a subsequence of Y, and for any T, if T is a subsequence of X and of Y, then |T| <= |S|. Let LCS(X,Y) = length of any longest common subsequence of X and Y.
%C A094858 For each common subsequence S of X and Y, there may be several ways to delete entries from X and from Y to get S, but in this version of the problem we do not take this into account (cf. A094837). Let F(X,Y) be the number of different choices for S, without regard to where it appears in X and Y. Sequence gives max F(X,Y) over all choices for binary strings X and Y of length n.
%C A094858 For this version of the problem using a larger alphabet helps (cf. A094859, A094863).
%C A094858 For an alphabet of size m, the maximum appears to be attained for X=123..m123..m..., except for some small values of n.
%Y A094858 Cf. A094859-A094862, A094837, A094824.
%O A094858 1,2
%K A094858 nonn,nice,more
%A A094858 Guenter Stertenbrink (Sterten(AT)aol.com), Jun 14 2004
%C A094858 Hill-climbing gives the following lower bounds for the next few terms: 26,36,50,70,96,141,192.
%I A094859
%S A094859 1,2,3,4,6,10
%N A094858 Maximal number of longest common subsequences between any two ternary strings of length n (Version 2).
%C A094858 Same as A094858 but now we can use an alphabet of size 3.
%O A094859 1,2
%K A094859 nonn,more
%A A094859 Guenter Stertenbrink (Sterten(AT)aol.com), Jun 14 2004
%I A094860
%S A094860 1,2,2,3,4,6,7,10,14,20,26,36,51,70,96,141
%N A094860 Same as A094858, except that we fix X = 121212...12(1).
%O A094860 1,2
%K A094860 nonn,more
%A A094860 Guenter Stertenbrink (Sterten(AT)aol.com), Jun 14 2004
%I A094861
%S A094861 1,2,3,4,6,10,14,24,36,58
%N A094861 Same as A094858, except that we fix X = 123123123...
%O A094861 1,2
%K A094861 nonn,more
%A A094861 Guenter Stertenbrink (Sterten(AT)aol.com), Jun 14 2004
%I A094862
%S A094862 1,2,3,4,7,10,19,28,51
%N A094862 Same as A094858, except that we fix X = 123412341234...
%O A094862 1,2
%K A094862 nonn,more
%A A094862 Guenter Stertenbrink (Sterten(AT)aol.com), Jun 14 2004
%I A094863
%S A094863 1,2,3,4,7
%N A094863 Maximal number of longest common subsequences between any two strings of length n (Version 2).
%C A094863 Same as A094858 (which has much more information about the problem), except that we now we allow an arbitrary alphabet.
%O A094863 1,2
%K A094863 nonn,more,nice,hard
%A A094863 Guenter Stertenbrink (Sterten(AT)aol.com), Jun 14 2004
***************************************************************************************
>From russcox at gmail.com Sat Jun 12 12:46:49 2004
Date: Sat, 12 Jun 2004 12:46:17 -0400
From: Russ Cox <russcox at gmail.com>
To: njas at research.att.com
Subject: Re: [math-fun] number of longest common subsequences problem
This is for Version 1:
Is it obvious that the best answer is always
going to use a two-character alphabet?
It seems to be true, empirically. My search
program would be a lot faster if it only checked
the two-character alphabet.
I think the C program below computes the initial terms
of this sequence, for a two-character alphabet or an
arbitrary alphabet. For a two-character alphabet:
len 1: 1 lcs of length 1 for a a
len 2: 2 lcs of length 1 for aa ab
len 3: 4 lcs of length 2 for aab abb
len 4: 10 lcs of length 2 for abba baab
len 5: 24 lcs of length 2 for abbba baaab
len 6: 46 lcs of length 3 for aabbba abaaab
len 7: 100 lcs of length 4 for aaaaabb aabbbbb
len 8: 225 lcs of length 4 for aaaaaabb aabbbbbb
len 9: 525 lcs of length 5 for aaaaaaabb aaabbbbbb
len 10: 1225 lcs of length 6 for aaaaaaabbb aaabbbbbbb
len 11: 3136 lcs of length 6 for aaaaaaaabbb aaabbbbbbbb
len 12: 7056 lcs of length 7 for aaaaaaaaabbb aaaabbbbbbbb
The search finds the alphabetically earliest pair of strings
that exhibit the maximum. It's interesting that len 4, 5, 6 don't
follow the same string pattern as all the rest.
Over alphabets of arbitrary size, the results match the above
through len 6. Beyond that, the search takes longer than
I am patient.
The sequence 1,2,4,10,24,46 does not appear in the OEIS.
I'd be more comfortable submitting it if the terms were
corroborated by someone else.
Russ
#include <stdio.h>
#include <stdlib.h>
#define MAXN 20
#define debug 1
typedef struct Match Match;
struct Match
{
int len;
int count;
};
/*
* Compute number of lcs for a, b using dynamic programming.
* The lcs for the substrings a[0:i] b[0:j] is one of
*
* if a[i]==b[j], lcs(a[0:i-1], b[0:j-1])+a[i]
* lcs(a[0:i-1], b[0:j])
* lcs(a[0:i], b[0:j-1])
*
* We can compute the number of lcs by just adding up the
* number from each place, except that if we take them from
* lcs(a[0:i-1], b[0:j]) and lcs(a[0:i], b[0:j-1]) and those in turn
* didn't actually add any new ones, then we'll double-count
* the ones from lcs(a[0:i-1], b[0:j-1]), so we need to subtract
* those out.
*/
int
numlcs(char *a, char *b, int n, int *plen)
{
int i, j, d, t, len, count;
Match match[MAXN][MAXN];
for(d=0; d<2*n-1; d++){
i = 0;
j = d-i;
if(j >= n){
j = n-1;
i = d-j;
}
for(; i<n && j>=0; i++, j--){
/* start with nothing */
len = 0;
count = 0;
/* try adding to a previous match */
if(a[i] == b[j]){
if(i>0 && j>0){
len = match[i-1][j-1].len+1;
count = match[i-1][j-1].count;
}else{
len = 1;
count = 1;
}
}
/* try matches not containing a[i] */
if(i>0){
if((t=match[i-1][j].len) > len){
len = t;
count = match[i-1][j].count;
}else if(t == len)
count += match[i-1][j].count;
}
/* try matches not containing b[j] */
if(j>0){
if((t=match[i][j-1].len) > len){
len = t;
count = match[i][j-1].count;
}else if(t == len)
count += match[i][j-1].count;
}
/* we may have double-counted matches not containing either */
if(i>0 && j>0 && match[i-1][j-1].len==len)
count -= match[i-1][j-1].count;
match[i][j].len = len;
match[i][j].count = count;
}
}
if(plen)
*plen = match[n-1][n-1].len;
return match[n-1][n-1].count;
}
/*
* Enumerate all pairs of strings of length n over an alphabet of size nalpha
* and run numlcs on them. buf contains the concatenation of the two strings;
* only nbuf of the characters have been filled in so far. The nbuf characters
* only use the first nused characters in the alphabet, so we need only
* consider the first nused+1 alphabet characters as next possible.
*
* Accumulate the current max in bestcount, bestlen, besta, bestb.
*/
int bestcount;
int bestlen;
char besta[MAXN+1];
char bestb[MAXN+1];
char *alpha = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ";
void
search(int nalpha, int nused, int n, char *buf, int nbuf)
{
int count, len;
char *p, *ep;
if(nbuf == 2*n){
count = numlcs(buf, buf+n, n, &len);
if(count > bestcount){
bestcount = count;
bestlen = len;
memmove(besta, buf, n);
memmove(bestb, buf+n, n);
}
}else{
ep = alpha+nalpha;
if(nused < nalpha)
ep = alpha+nused+1;
for(p=alpha; p<ep; p++){
buf[nbuf] = *p;
search(nalpha, (p<alpha+nused ? nused : nused+1),
n, buf, nbuf+1);
}
}
}
void
usage(void)
{
fprintf(stderr, "usage: lcs count aabaa aaabb\n");
fprintf(stderr, " or lcs search n\n");
fprintf(stderr, " or lcs search2 n\n");
exit(1);
}
int
main(int argc, char **argv)
{
int i, n, len;
char buf[2*MAXN];
if(argc < 2)
usage();
if(strcmp(argv[1], "count") == 0){
if(argc != 4)
usage();
n = strlen(argv[2]);
if(n != strlen(argv[3])){
fprintf(stderr, "string lengths must match\n");
exit(1);
}
n = numlcs(argv[2], argv[3], n, &len);
printf("%d of length %d\n", n, len);
}else if(strcmp(argv[1], "search") == 0){
if(argc != 3)
usage();
n = atoi(argv[2]);
search(2*n, 0, n, buf, 0);
printf("len %d: %d lcs of length %d for %s %s (all alphabets)\n",
n, bestcount, bestlen,besta, bestb);
}else if(strcmp(argv[1], "search2") == 0){
if(argc != 3)
usage();
n = atoi(argv[2]);
search(2, 0, n, buf, 0);
printf("len %d: %d lcs of length %d for %s %s (just alphabet=ab)\n",
n, bestcount, bestlen, besta, bestb);
}else
usage();
return 0;
}
Me: I added these two sequences:
%I A094837
%S A094837 1,2,4,10,24,46,100,225,525,1225,3136,7056
%N A094837 Maximal number of longest common subsequences between any two binary strings of length n (Version 1).
%C A094837 Definitions: S is a subsequence of X if S can be obtained by deleting some (not necessarily adjacent) entries of X.
%C A094837 S is a longest common subsequence of X and Y if S is a subsequence of X, S is a subsequence of Y, and for any T, if T is a subsequence of X and of Y, then |T| <= |S|. Let LCS(X,Y) = length of any longest common subsequence of X and Y.
%C A094837 For each common subsequence S of X and Y, there may be several ways to delete entries from X and from Y to get S: let F(X,Y) be the total number of ways. Sequence gives max F(X,Y) over all choices for binary strings X and Y of length n.
%C A094837 It appears that using a larger alphabet does not increase the answers: is this true?
%O A094837 1,2
%K A094837 nonn,nice,more
%Y A094837 A094838 gives length of S. See A094824 for a related problem involving substrings.
%A A094837 Russ Cox (russcox(AT)gmail.com), Jun 13 2004
%e A094837 Example illustrating a(4) = 10:
%e A094837 abba baab S
%e A094837 ------------
%e A094837 a..a .aa. aa
%e A094837 ab.. .a.b ab
%e A094837 ab.. ..ab ab
%e A094837 a.b. .a.b ab
%e A094837 a.b. ..ab ab
%e A094837 .bb. b..b bb
%e A094837 .b.a ba.. ba
%e A094837 .b.a b.a. ba
%e A094837 ..ba ba.. ba
%e A094837 ..ba b.a. ba
%e A094837 Details for lengths 1 through 12 showing lexicographically earliest examples for X and Y:
%e A094837 len 1: 1 lcs of length 1 for a a
%e A094837 len 2: 2 lcs of length 1 for aa ab
%e A094837 len 3: 4 lcs of length 2 for aab abb
%e A094837 len 4: 10 lcs of length 2 for abba baab
%e A094837 len 5: 24 lcs of length 2 for abbba baaab
%e A094837 len 6: 46 lcs of length 3 for aabbba abaaab
%e A094837 len 7: 100 lcs of length 4 for aaaaabb aabbbbb
%e A094837 len 8: 225 lcs of length 4 for aaaaaabb aabbbbbb
%e A094837 len 9: 525 lcs of length 5 for aaaaaaabb aaabbbbbb
%e A094837 len 10: 1225 lcs of length 6 for aaaaaaabbb aaabbbbbbb
%e A094837 len 11: 3136 lcs of length 6 for aaaaaaaabbb aaabbbbbbbb
%e A094837 len 12: 7056 lcs of length 7 for aaaaaaaaabbb aaaabbbbbbbb
%I A094838
%S A094838 1,1,2,2,2,3,4,4,5,6,6,7
%N A094838 Length of longest common subsequence corresponding to answers in A094837.
%O A094838 1,3
%K A094838 nonn,more
%A A094838 Russ Cox (russcox(AT)gmail.com), Jun 13 2004
***************************************************************************************
Date: Sat, 12 Jun 2004 13:00:19 -0400
To: math-fun <math-fun at mailman.xmission.com>
From: "Allan C. Wechsler" <acw at alum.mit.edu>
Subject: [math-fun] Most repeated longest subsequence
I'm having a little bit of trouble understanding the problem. It's
a matter of how the quantifiers are nested, and how the counting is
to be done. I would greatly appreciate it if Neil would go back to
Al Aho and get a really clear description of the problem.
My difficulties begin with the definition of f(X,Y). It is not clear
that we are permitted to count multiple occurrences of the same
subsequence; it seems equally probable to me that Aho wants to count
only subsequences with distinct content. I have other similar
hairsplitting problems, but let's start with that one. And ...
remember the Al Aho.
--ACW
Me: I hope the edited messages above have now answered these questions.
***************************************************************************************
From: "Thane Plambeck" <thane at best.com>
Subject: Re: [math-fun] subsequence, substrings, np completeness
1) Just a nit, but usually a string obtained by possibly non adjacent deletions
is called a "subsequence," not a "substring."
Me: Yes, sorry, I have tried to correct that in this edited file.
2) I'll contribute also the following NP-complete problem reference from Garey
and Johnson's book (pg 228 in my copy):
[SR10] Longest Common Subsequence
Instance: Finite alpha \Sigma, finite set R of strings from \Sigma^{*} and a positive
integer K.
Question: Is there a string w with |w| \geq K such that w is a subsequence of each x \in R?
Reference: Transformation from vertex cover [maier, 1978]. Remains NP complete even if
|\Sigma| = 2. Solvable in polynomial time for any fixed K or for fixed |R| (by dynamic programming).
The analogous longest common substring problem is triviallly solvable in polynomial time
Thane Plambeck
650 321 4884 office
650 323 4928 fax
http://www.plambeck.org
***************************************************************************************
Date: Sat, 12 Jun 2004 16:28:37 -0400 (EDT)
From: Edwin Clark <eclark at math.usf.edu>
Subject: Re: number of longest common subsequences problem
There seem to be many different ways to interpret this problem. In
particular internet seaches indicate there is a difference between common
substrings and common subsequences. From Neil's description it seems he is
referring to what some call common "subsequences".
Maple, for example, in its StringTools package has the two procedures
LongestCommonSubString(s1,s2) and LongestCommonSubSequence(s1,s2)
where s1 and s2 are strings.
These have the following descriptions (which seem to agree with other
definitions found by googling around.)
--------------------------------------------------------------------
A substring of a string S is a contiguous sequence of the characters
appearing in S. The empty string is a substring of every string. A
subsequence of a string S is a sequence of characters from S, which may
not be contiguous in S. Every substring of S is a subsequence of S. For
example, "bc" is a substring of "abc", and "ac" is a subsequence of "abc"
which is not a substring.
The LongestCommonSubString(s1, s2) command returns from its input strings,
s1 and s2, a common substring of maximum length.
Many common substrings of maximum length may exist. Which among the
candidates is returned depends upon the suffix structure of the pair of
strings, but is deterministic.
The LongestCommonSubSequence(s1, s2) command is similar, but searches for
subsequences of the pair of input strings rather than substrings.
--------------------------------------------------------------------
Once we know what a longest common subsequence or substring is, there is
the problem of how to count them: with multiplicities or not?
Anyhow, I did one of the simplest interpretations, presumably NOT one that
Aho intended, nevertheless perhaps new and of interest:
I'm counting (contiguous) substrings and not counting multiplicity:
What I did was to find the set CS(s1,s2) of common substrings (in the
above sense) for binary strings s1 and s2 of length n and count the number
A(s1,s2) of strings in CS(s1,s2) that are the longest. Then a(n) is the
max of A(s1,s2) over all such strings. If I haven't made an error
this gives the sequence a(n), n = 1,...,11:
1, 2, 2, 2, 3, 3, 4, 6, 6, 7, 7
For example the fact that a(7) = 4 is given by the two strings
s = 0001011,t = 0011010 for which the set of maximum length
common substrings is {011,001,101,010} and the fact that no other pair of
strings of length 7 has more than 4 common maximum length substrings.
Apparently this sequence is not is the OEIS. But it seems simple enough
that maybe someone can find a formula for it. And maybe someone using
something faster than Maple can compute more terms even if no one can
figure out a formula.
--Edwin
Here is the new sequence:
%I A094824
%S A094824 1,2,2,2,3,3,4,6,6,7,7,8
%N A094824 Maximum number of longest common substrings of two binary sequences of length n.
%C A094824 A substring of a string is a subsequence of contiguous symbols in the string. For example, 00 is a substring of 001 but not of 010. For this sequence we do not count the multiplicity of occurrence of common substrings.
%e A094824 a(7) = 4 since the two strings 0001011 and 0011010 have as maximum length common substrings the 4 strings 011,001,101,010, and computer search shows that no other pair of strings of length 7 has more than 4 common maximum length substrings.
%Y A094824 See A094837 for a related sequence.
%K A094824 hard,more,nonn,nice,new
%O A094824 1,2
%A A094824 W. Edwin Clark (eclark(AT)math.usf.edu), Jun 12 2004
%Y A094837 A094838 gives length of S. See A094824 for a related problem involving substrings.
%Y A094858 Cf. A094859-A094862, A094837, A094824.
***************************************************************************************
From njas at research.att.com Sat Jun 12 17:17:45 2004
To: seqfan at ext.jussieu.fr, math-fun at mailman.xmission.com
Subject: Re: longest common subsequence - clarification of problem
Al Aho tells me that the problem he is interested in
is Version 2: find the maximum number of longest common subsequences
of two sequences X and Y of length n, without regard to
where the subsequence occurs in X and Y.
So for X = aabaa, Y = aaabb, the length of the lcs is 3,
and there are two subsequences of length 3 that are
possible, aaa and aab. So here F(X,Y) = 2.
Some definitions:
S is a subsequence of X if S can be
obtained by deleting some of the entries of X
(not necessarily adjacent).
S is a longest common subsequence of X and Y if S is a subsequence of X,
S is a subsequence of Y, and for any T, if T is a
subsequence of X and of Y, then |T| <= |S|.
(Thanks to Michael Kleber for a correction here)
Given sequences X and Y of length n, let LCS(X,Y) be the
length of the longest common subsequence.
Let F(X,Y) = no of sequences S such that S is a
longest common subsequence of X and Y. So S has length LCS(X,Y).
Given n, what is the maximal value of F(X,Y)
over all choices of X and Y over any alphabet?
That's Version 2.
The version of the problem that i stated earlier, Version 1,
where you also take into account the location of S inside X and Y,
is also interesting, it is just not Al's original question.
NJAS
***************************************************************************************
From wouter.meeussen at pandora.be Sat Jun 12 18:25:25 2004
Subject: Re: [math-fun] Re: longest common subsequence - clarification of problem
Date: Sun, 13 Jun 2004 00:25:01 +0200
my earlier program stopped too soon, better:
LongestCommonSubstring[keep : {{__} ..}] :=
Module[{done = False, uit = {}, it = keep, alfa = {1}, z},
While[Intersection @@ it =!= {} && ! done,
z = 1;
While[ ! (done = ( {0} === (alfa ))) && (alfa =
Intersection @@ Take[it, All, {1, z}]) === {},
z++];
If[done, Break[]];(* Print[{z, alfa[[1]]}] *);
AppendTo[uit,
temp = Flatten[
Thread[w[Take[it, All, {1, z}], alfa[[1]], 1, 1]] /.
w -> Position]];
it = Thread[rop[it, temp]] /. rop -> Drop ;
it = PadRight[#, Max[Length /@ keep]] & /@ it;(* Print[it] *)];
Drop[Rest at FoldList[Plus, 0, uit], -1]]
other lines as before:
keep = Table[Random[Integer, {1, 4}], {8}, {36}]
soln = LongestCommonSubstring[keep]
MapAt[w, keep, Flatten[Transpose[{Range[Length[#]], #}] & /@ soln, 1]] /.
w[i_] :> (j = (i - 1)/4.;
StyleForm[i, FontColor -> Hue[j, 1, 1], FontWeight -> "Bold"] )
Length @ soln
First[keep][[First /@ soln]]
W.
***************************************************************************************
From rschroe at sandia.gov Sun Jun 13 22:01:17 2004
From: "Schroeppel, Richard" <rschroe at sandia.gov>
To: "'Russ Cox '" <russcox at gmail.com>,
"'njas at research.att.com '" <njas at research.att.com>
Cc: "Schroeppel, Richard" <rschroe at sandia.gov>,
"'math-fun '" <math-fun at mailman.xmission.com>,
"'seqfan at ext.jussieu.fr '" <seqfan at ext.jussieu.fr>
Subject: RE: [math-fun] number of longest common substrings problem
Curious that so many of Russ's counts are squares:
1,-,4,-,-,-,100,225,-,1225,3136,7056
Rich
***************************************************************************************
From hv at crypt.org Mon Jun 14 07:46:09 2004
Cc: seqfan at ext.jussieu.fr, math-fun at mailman.xmission.com
Subject: Re: number of longest common substrings problem
"N. J. A. Sloane" <njas at research.att.com> wrote:
:Hugo asks:
:
:> Here you introduce the word "different" for the first time. Are two
:> substrings "different" if the substrings themselves differ, or is it
:> sufficient for the letters of the substring to be drawn from different
:> positions in X or Y? Ie given:
:> X = "aabaa", Y = "aaabb"
:> are there 1 or 4 occurrences of the LCS "aaa"?
:
:Me: The latter, I think, since the goal of the question is to
:determine how rapidly these numbers can grow.
In that case, here are some initial thoughts:
- a(n) has a lower bound of A001405 (C(n, floor(n/2))), since we can
always construct an LH string of all 'a's, and a RH string of
floor(n/2) a's followed by ceil(n/2) b's.
- a couple of hand-crafted examples of a(n) exceeding that lower bound:
(aab, abb) => a(3) >= 4 > C(3,1)
(abbbc, aabcc) => a(5) >= 12 > C(5,2)
(aaabb, abbbb) => a(5) >= 18
(abbbbc, aabccc) => a(6) >= 24 > C(6,3)
(abbbbc, aabbcc) => a(6) >= 24
(aaaaab, aabbbb) => a(6) >= 40
I think this serves to demonstrate the style of string pairs that will
maximise the count; I'm not yet sure whether it is ever beneficial to
have non-monotonic strings such as "abababab"; if not, this essentially
boils down to finding <x> and <y>, two ordered partitions of n such that
prod(C(max(x_i,y_i), min(x_i,y_i))) is maximised; that product would then
be a(n). If I've calculated correctly, that gives a sequence starting:
1 2 4 9 18 40 100 225 525 ...
.. but I don't see that currently in EIS: have I miscalculated? Those
figures can be generated by the following strings:
n a(n)? strings
1 1 a a
2 2 aa ab
3 4 aab abb
4 9 aaab abbb
5 18 aaaab aabbb
6 40 aaaaab aabbbb
7 100 aaaaabb aabbbbb
8 225 aaaaaabb aabbbbbb
9 525 aaaaaaabb aaabbbbbb
By eye, it appears that the maximum product is always generated by
two-element partitions, in which case it further simplifies to selecting
0 <= x <= y <= n such that C(y, x) * C(n-x, n-y) is maximised, which lets
me quickly extend the above sequence:
1 2 4 9 18 40 100 225 525 1225 3136 7056 17640 44100 108900 261360 637065
1656369 4008004 10020010 25050025 64128064 155739584 393853824 1012766976
2538950544 6347376360 15868440900 41408180100 102252852900 261312846300
667799496100 1709566710016 4273916775040 10851741811625 28214528710225
71170904601225 181162302621300 461140406672400
This sequence exhibits some fascinating patterns: if we look at the
factorisation of these as a vector of powers of p_n, we get this:
1: (0)
2: 1 (1)
3: 2 (2)
4: 0 2 (2)
5: 1 2 (3)
6: 3 0 1 (4) # ie f(6) = 40 = 2^3 * 3^0 * 5^1 (3 + 0 + 1 = 4)
7: 2 0 2 (4)
8: 0 2 2 (4)
9: 0 1 2 1 (4)
10: 0 0 2 2 (4)
11: 6 0 0 2 (8)
12: 4 2 0 2 (8)
13: 3 2 1 2 (8)
14: 2 2 2 2 (8)
15: 2 2 2 0 2 (8)
16: 4 3 1 0 2 (10)
17: 0 4 1 0 2 1 (8)
18: 0 4 0 0 2 2 (8)
19: 2 0 0 2 2 2 (8)
20: 1 0 1 2 2 2 (8)
21: 0 0 2 2 2 2 (8)
22: 6 0 0 2 2 2 (12)
23: 6 0 0 1 2 2 1 (12)
24: 7 2 0 1 0 2 2 (14)
25: 8 4 0 0 0 2 2 (16)
26: 4 2 0 0 0 2 2 2 (12)
27: 3 2 1 0 0 2 2 2 (12)
28: 2 2 2 0 0 2 2 2 (12)
29: 2 4 2 2 0 0 2 2 (14)
30: 2 4 2 0 2 0 2 2 (14)
31: 2 2 2 0 2 0 2 2 1 (13)
32: 2 0 2 0 2 0 2 2 2 (12)
33: 8 0 0 0 2 0 2 2 2 (16)
34: 7 0 1 0 2 0 2 2 2 (16)
35: 0 0 3 0 2 1 2 2 2 (12)
36: 0 0 2 0 2 2 2 2 2 (12)
37: 0 6 2 0 2 2 0 2 2 (16)
38: 2 6 2 1 1 2 0 2 2 (18)
39: 4 6 2 2 0 2 0 2 2 (20)
.. and this could generate further interesting new sequences (eg f(n)
is odd; f(n) is not square (do the first differences of this form a
cyclce of period 7?); f(n) has an odd number of prime factors).
In case others find it useful, I've put up a perl program to calculate
the LCS count for two strings at <http://www.crypt.org/hv/puzzle/lcsc>.
(It requires the additional perl module Algorithm::Loops, found at
<http://search.cpan.org/~tyemq/Algorithm-Loops>.)
This program is fast for strings up to a certain complexity, but beyond
that consumes large amounts of memory as the number of different
subsequences undergoes a combinatorial explosion.
Hugo
ME: I DID NOT ADD THE ABOVE SEQUENCE YET, because I don't
understand why it differs from Russ Cox's sequence A094837 (see
earlier in this message), which has a(4) = 10 not 9.
Is Hugo's sequence wrong, or is it the answer to a different question?
NJAS
***************************************************************************************
(end of message)
More information about the SeqFan
mailing list