Motivation
Foundational string problem underlying diff/version control, bioinformatics (DNA/protein alignment), and data compression. Poly-time for 2 strings via DP, but NP-hard for $k$ strings. Has known reductions to ILP and SAT. Garey & Johnson SR10. This problem is also the core of the git diff.
Definition
Name: LongestCommonSubsequence
Reference: Garey & Johnson, 1979 (SR10); Maier, 1978
Given a finite alphabet $\Sigma$ and a set of $k$ strings $R = {s_1, \ldots, s_k}$ over $\Sigma$, find a longest string $w$ that is a subsequence of every $s_i \in R$.
A string $w$ is a subsequence of $s$ if $w$ can be obtained by deleting zero or more characters from $s$ without changing the order of the remaining characters.
Variables
-
Count: $m$ (length of the shortest string in $R$; upper bound on solution length)
-
Per-variable domain: $\Sigma$ (alphabet symbols)
-
Meaning: $w_j$ is the $j$-th character of the common subsequence $w$
Schema (data type)
Type name: LongestCommonSubsequence
| Field |
Type |
Description |
| strings |
Vec<Vec<u8>> |
The input strings $R = {s_1, \ldots, s_k}$
|
Problem Size
| Metric |
Expression |
Description |
| num_strings |
$k$ |
Number of input strings |
| total_length |
$\sum_i \lvert s_i \rvert$ |
Sum of all string lengths |
Complexity
-
Two strings: $O(mn)$ via dynamic programming (Wagner & Fischer, 1974)
-
$k$ strings: NP-hard (Maier, 1978)
-
Approximation: Not approximable within $n^{1/4-\varepsilon}$ for any $\varepsilon > 0$; approximable within $O(m / \log m)$ where $m = \min_i |s_i|$
How to solve
Bruteforce: enumerate all $2^m$ subsequences of the shortest string $s_1$, check each against all other strings, return the longest common one.
Example Instance
$k = 3$ strings over alphabet $\Sigma = {A, B, C, D}$:
| String |
Value |
| $s_1$ |
ABCDAB |
| $s_2$ |
BDCABA |
| $s_3$ |
BCADBA |
Optimal solution: $w$ = BCAB, length 4.
Verification ($w$ is a subsequence of each $s_i$):
| String |
Matched positions |
Trace |
ABCDAB |
1, 2, 4, 5 |
ABCDAB |
BDCABA |
0, 2, 3, 4 |
BDCABA |
BCADBA |
0, 1, 2, 4 |
BCADBA |
Note: pairwise LCS lengths are 4, 4, and 5 respectively — the 3-string LCS is harder than any pairwise LCS. For $k = 2$ this problem is solvable in $O(mn)$ time, but the $k$-string generalization is NP-hard.
Motivation
Foundational string problem underlying diff/version control, bioinformatics (DNA/protein alignment), and data compression. Poly-time for 2 strings via DP, but NP-hard for$k$ strings. Has known reductions to ILP and SAT. Garey & Johnson SR10. This problem is also the core of the git diff.
Definition
Name: LongestCommonSubsequence
Reference: Garey & Johnson, 1979 (SR10); Maier, 1978
Given a finite alphabet$\Sigma$ and a set of $k$ strings $R = {s_1, \ldots, s_k}$ over $\Sigma$ , find a longest string $w$ that is a subsequence of every $s_i \in R$ .
A string$w$ is a subsequence of $s$ if $w$ can be obtained by deleting zero or more characters from $s$ without changing the order of the remaining characters.
Variables
Schema (data type)
Type name: LongestCommonSubsequence
Vec<Vec<u8>>Problem Size
Complexity
How to solve
Bruteforce: enumerate all$2^m$ subsequences of the shortest string $s_1$ , check each against all other strings, return the longest common one.
Example Instance
ABCDABBDCABABCADBAOptimal solution:$w$ =
BCAB, length 4.Verification ($w$ is a subsequence of each $s_i$ ):
ABCDABBDCABABCADBANote: pairwise LCS lengths are 4, 4, and 5 respectively — the 3-string LCS is harder than any pairwise LCS. For$k = 2$ this problem is solvable in $O(mn)$ time, but the $k$ -string generalization is NP-hard.