Skip to content

[Rule] LongestCommonSubsequence to MaximumIndependentSet #109

@zazabap

Description

@zazabap

Source: LongestCommonSubsequence
Target: MaximumIndependentSet
Motivation: Connects string problems to graph theory; enables solving LCS via MaxIS solvers and reveals structural equivalence between subsequence ordering and graph independence.
Reference: Apostolico & Guerra, 1987; Santini, Blum, Djukanovic et al., 2021

Reduction Algorithm

Notation:

  • Source: $k$ strings $s_1, \ldots, s_k$ over alphabet $\Sigma$, with lengths $n_1, \ldots, n_k$
  • Target: undirected graph $G = (V, E)$

Step 1 — Construct match nodes $V$:

A node $v = (p_1, p_2, \ldots, p_k)$ is created for each $k$-tuple of positions such that $s_1[p_1] = s_2[p_2] = \cdots = s_k[p_k]$ (all strings have the same character at their respective positions).

$$V = {(p_1, \ldots, p_k) : s_1[p_1] = s_2[p_2] = \cdots = s_k[p_k]}$$

Step 2 — Construct conflict edges $E$:

Two nodes $u = (a_1, \ldots, a_k)$ and $v = (b_1, \ldots, b_k)$ are connected by an edge if they conflict — i.e., they cannot both appear in a valid common subsequence. A conflict occurs when the position differences are not consistently ordered across all strings:

$${u, v} \in E \iff \text{NOT}\ (\forall i: a_i < b_i) \ \text{AND NOT}\ (\forall i: a_i > b_i)$$

In other words, there exist indices $i, j$ such that $a_i &lt; b_i$ but $a_j &gt; b_j$ (a "crossing"), or $a_i = b_i$ for some $i$ (same position used twice).

Variable mapping: Each LCS character corresponds to a selected node; each MaxIS vertex in the independent set corresponds to a matched position tuple.

Objective: $\text{MaxIS}(G) = \text{LCS}(s_1, \ldots, s_k)$. An independent set of size $L$ gives a common subsequence of length $L$, and vice versa.

Required Model Extension

The implementer must add the following getter to LongestCommonSubsequence in src/models/misc/longest_common_subsequence.rs in the same PR as the reduction rule:

/// Returns the cross-frequency product: the sum over each alphabet symbol
/// of the product of that symbol's frequency across all input strings.
///
/// Formally: Σ_{c ∈ 0..alphabet_size} ∏_{i=1..k} count(c, strings[i])
/// where count(c, s) is the number of occurrences of symbol c in string s.
///
/// This equals the exact number of match-node vertices in the LCS → MaxIS
/// reduction graph.
pub fn cross_frequency_product(&self) -> usize {
    (0..self.alphabet_size)
        .map(|c| {
            self.strings
                .iter()
                .map(|s| s.iter().filter(|&&sym| sym == c).count())
                .product::<usize>()
        })
        .sum()
}

This getter is referenced in the #[reduction(overhead)] attribute as cross_frequency_product.

Size Overhead

Target metric (code name) Expression
num_vertices cross_frequency_product
num_edges cross_frequency_product^2

Where cross_frequency_product = $\sum_{c \in \Sigma} \prod_{i=1}^{k} \text{count}(c, s_i)$. This equals the exact number of match nodes $|V|$. The edge count is a worst-case bound (all node pairs conflict).

Validation Method

Closed-loop testing: solve LCS by brute-force (enumerate subsequences of shortest string), solve the reduced MaxIS instance, and verify that both give the same optimal value. Test on instances with varying alphabet sizes and string counts ($k = 2, 3, 4$).

Example

Source instance: $k = 2$, $s_1$ = ABAC, $s_2$ = BACA, alphabet $\Sigma = {A, B, C}$.

Step 1 — Match nodes:

Node $s_1$ pos $s_2$ pos Character
$v_0$ 0 1 A
$v_1$ 0 3 A
$v_2$ 1 0 B
$v_3$ 2 1 A
$v_4$ 2 3 A
$v_5$ 3 2 C

Step 2 — Conflict edges (9 edges):

Edge Reason
$v_0 - v_1$ same $s_1$ position (0)
$v_0 - v_2$ crossing: $s_1$: $0&lt;1$, $s_2$: $1&gt;0$
$v_0 - v_3$ same $s_2$ position (1)
$v_1 - v_2$ crossing: $s_1$: $0&lt;1$, $s_2$: $3&gt;0$
$v_1 - v_3$ crossing: $s_1$: $0&lt;2$, $s_2$: $3&gt;1$
$v_1 - v_4$ same $s_2$ position (3)
$v_1 - v_5$ crossing: $s_1$: $0&lt;3$, $s_2$: $3&gt;2$
$v_3 - v_4$ same $s_1$ position (2)
$v_4 - v_5$ crossing: $s_1$: $2&lt;3$, $s_2$: $3&gt;2$

MaxIS solution: ${v_2, v_3, v_5}$ (size 3, unique).

Node Position Char
$v_2$ $s_1[1], s_2[0]$ B
$v_3$ $s_1[2], s_2[1]$ A
$v_5$ $s_1[3], s_2[2]$ C

Extracted LCS: BAC (length 3). Verified: B-A-C is a subsequence of both ABAC (positions 1,2,3) and BACA (positions 0,1,2).

Metadata

Metadata

Assignees

No one assigned

    Labels

    GoodAn issue passed all checks.ruleA new reduction rule to be added.

    Type

    No type

    Projects

    Status

    Done

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions