number of longest common substrings problem

hv at crypt.org hv at crypt.org
Mon Jun 14 13:47:50 CEST 2004


"N. J. A. Sloane" <njas at research.att.com> wrote:
:Hugo asks:
:
:> Here you introduce the word "different" for the first time. Are two
:> substrings "different" if the substrings themselves differ, or is it
:> sufficient for the letters of the substring to be drawn from different
:> positions in X or Y? Ie given:
:>    X = "aabaa", Y = "aaabb"
:> are there 1 or 4 occurrences of the LCS "aaa"?
:
:Me:  The latter, I think, since the goal of the question is to
:determine how rapidly these numbers can grow.

In that case, here are some initial thoughts:

- a(n) has a lower bound of A001405 (C(n, floor(n/2))), since we can
  always construct an LH string of all 'a's, and a RH string of
  floor(n/2) a's followed by ceil(n/2) b's.

- a couple of hand-crafted examples of a(n) exceeding that lower bound:
     (aab, abb)    => a(3) >=  4 > C(3,1)
   (abbbc, aabcc)  => a(5) >= 12 > C(5,2)
   (aaabb, abbbb)  => a(5) >= 18
  (abbbbc, aabccc) => a(6) >= 24 > C(6,3)
  (abbbbc, aabbcc) => a(6) >= 24
  (aaaaab, aabbbb) => a(6) >= 40

I think this serves to demonstrate the style of string pairs that will
maximise the count; I'm not yet sure whether it is ever beneficial to
have non-monotonic strings such as "abababab"; if not, this essentially
boils down to finding <x> and <y>, two ordered partitions of n such that
prod(C(max(x_i,y_i), min(x_i,y_i))) is maximised; that product would then
be a(n). If I've calculated correctly, that gives a sequence starting:
  1 2 4 9 18 40 100 225 525 ...
.. but I don't see that currently in EIS: have I miscalculated? Those
figures can be generated by the following strings:
  n  a(n)?  strings
  1      1  a a
  2      2  aa ab
  3      4  aab abb
  4      9  aaab abbb
  5     18  aaaab aabbb
  6     40  aaaaab aabbbb
  7    100  aaaaabb aabbbbb
  8    225  aaaaaabb aabbbbbb
  9    525  aaaaaaabb aaabbbbbb

By eye, it appears that the maximum product is always generated by
two-element partitions, in which case it further simplifies to selecting
0 <= x <= y <= n such that C(y, x) * C(n-x, n-y) is maximised, which lets
me quickly extend the above sequence:
  1 2 4 9 18 40 100 225 525 1225 3136 7056 17640 44100 108900 261360 637065
  1656369 4008004 10020010 25050025 64128064 155739584 393853824 1012766976
  2538950544 6347376360 15868440900 41408180100 102252852900 261312846300
  667799496100 1709566710016 4273916775040 10851741811625 28214528710225
  71170904601225 181162302621300 461140406672400

This sequence exhibits some fascinating patterns: if we look at the
factorisation of these as a vector of powers of p_n, we get this:
 1: (0)
 2: 1 (1)
 3: 2 (2)
 4: 0 2 (2)
 5: 1 2 (3)
 6: 3 0 1 (4)   # ie f(6) = 40 = 2^3 * 3^0 * 5^1 (3 + 0 + 1 = 4)
 7: 2 0 2 (4)
 8: 0 2 2 (4)
 9: 0 1 2 1 (4)
10: 0 0 2 2 (4)
11: 6 0 0 2 (8)
12: 4 2 0 2 (8)
13: 3 2 1 2 (8)
14: 2 2 2 2 (8)
15: 2 2 2 0 2 (8)
16: 4 3 1 0 2 (10)
17: 0 4 1 0 2 1 (8)
18: 0 4 0 0 2 2 (8)
19: 2 0 0 2 2 2 (8)
20: 1 0 1 2 2 2 (8)
21: 0 0 2 2 2 2 (8)
22: 6 0 0 2 2 2 (12)
23: 6 0 0 1 2 2 1 (12)
24: 7 2 0 1 0 2 2 (14)
25: 8 4 0 0 0 2 2 (16)
26: 4 2 0 0 0 2 2 2 (12)
27: 3 2 1 0 0 2 2 2 (12)
28: 2 2 2 0 0 2 2 2 (12)
29: 2 4 2 2 0 0 2 2 (14)
30: 2 4 2 0 2 0 2 2 (14)
31: 2 2 2 0 2 0 2 2 1 (13)
32: 2 0 2 0 2 0 2 2 2 (12)
33: 8 0 0 0 2 0 2 2 2 (16)
34: 7 0 1 0 2 0 2 2 2 (16)
35: 0 0 3 0 2 1 2 2 2 (12)
36: 0 0 2 0 2 2 2 2 2 (12)
37: 0 6 2 0 2 2 0 2 2 (16)
38: 2 6 2 1 1 2 0 2 2 (18)
39: 4 6 2 2 0 2 0 2 2 (20)

.. and this could generate further interesting new sequences (eg f(n)
is odd; f(n) is not square (do the first differences of this form a
cyclce of period 7?); f(n) has an odd number of prime factors).

In case others find it useful, I've put up a perl program to calculate
the LCS count for two strings at <http://www.crypt.org/hv/puzzle/lcsc>.
(It requires the additional perl module Algorithm::Loops, found at
<http://search.cpan.org/~tyemq/Algorithm-Loops>.)

This program is fast for strings up to a certain complexity, but beyond
that consumes large amounts of memory as the number of different
subsequences undergoes a combinatorial explosion.

Hugo





More information about the SeqFan mailing list