[Model] LongestCommonSubsequence

## Motivation

Foundational string problem underlying diff/version control, bioinformatics (DNA/protein alignment), and data compression. Poly-time for 2 strings via DP, but NP-hard for $k$ strings. Has known reductions to ILP and SAT. Garey & Johnson SR10. This problem is also the core of the git diff.

## Definition

**Name:** LongestCommonSubsequence
**Reference:** [Garey & Johnson, 1979 (SR10)](https://en.wikipedia.org/wiki/Longest_common_subsequence); [Maier, 1978](https://doi.org/10.1145/322063.322075)

Given a finite alphabet $\Sigma$ and a set of $k$ strings $R = \{s_1, \ldots, s_k\}$ over $\Sigma$, find a longest string $w$ that is a subsequence of every $s_i \in R$.

A string $w$ is a **subsequence** of $s$ if $w$ can be obtained by deleting zero or more characters from $s$ without changing the order of the remaining characters.

## Variables

- **Count:** $m$ (length of the shortest string in $R$; upper bound on solution length)
- **Per-variable domain:** $\Sigma$ (alphabet symbols)
- **Meaning:** $w_j$ is the $j$-th character of the common subsequence $w$

## Schema (data type)

**Type name:** LongestCommonSubsequence

| Field | Type | Description |
|-------|------|-------------|
| strings | `Vec<Vec<u8>>` | The input strings $R = \{s_1, \ldots, s_k\}$ |

## Problem Size

| Metric | Expression | Description |
|--------|------------|-------------|
| num_strings | $k$ | Number of input strings |
| total_length | $\sum_i \lvert s_i \rvert$ | Sum of all string lengths |

## Complexity

- **Two strings:** $O(mn)$ via dynamic programming (Wagner & Fischer, 1974)
- **$k$ strings:** NP-hard (Maier, 1978)
- **Approximation:** Not approximable within $n^{1/4-\varepsilon}$ for any $\varepsilon > 0$; approximable within $O(m / \log m)$ where $m = \min_i |s_i|$

## How to solve

- [x] It can be solved by (existing) bruteforce.
- [ ] It can be solved by reducing to integer programming, through a new LCS → ILP rule issue (to be filed).

Bruteforce: enumerate all $2^m$ subsequences of the shortest string $s_1$, check each against all other strings, return the longest common one.

## Example Instance

$k = 3$ strings over alphabet $\Sigma = \{A, B, C, D\}$:

| String | Value |
|:---:|:---:|
| $s_1$ | `ABCDAB` |
| $s_2$ | `BDCABA` |
| $s_3$ | `BCADBA` |

Optimal solution: $w$ = `BCAB`, length 4.

Verification ($w$ is a subsequence of each $s_i$):

| String | Matched positions | Trace |
|:---:|:---:|:---:|
| `ABCDAB` | 1, 2, 4, 5 | A**BC**D**AB** |
| `BDCABA` | 0, 2, 3, 4 | **B**D**CAB**A |
| `BCADBA` | 0, 1, 2, 4 | **BCA**D**B**A |

Note: pairwise LCS lengths are 4, 4, and 5 respectively — the 3-string LCS is harder than any pairwise LCS. For $k = 2$ this problem is solvable in $O(mn)$ time, but the $k$-string generalization is NP-hard.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Model] LongestCommonSubsequence #108

Motivation

Definition

Variables

Schema (data type)

Problem Size

Complexity

How to solve

Example Instance

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Metric	Expression	Description
num_strings	$k$	Number of input strings
total_length	$\sum_i \lvert s_i \rvert$	Sum of all string lengths

String	Matched positions	Trace
`ABCDAB`	1, 2, 4, 5	ABCDAB
`BDCABA`	0, 2, 3, 4	BDCABA
`BCADBA`	0, 1, 2, 4	BCADBA

[Model] LongestCommonSubsequence #108

Description

Motivation

Definition

Variables

Schema (data type)

Problem Size

Complexity

How to solve

Example Instance

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions