Motivation
SHORTEST COMMON SUPERSEQUENCE from Garey & Johnson, A4 SR8. A fundamental NP-complete problem in string algorithms. Given a set of strings, find the shortest string that contains each input string as a subsequence (not necessarily contiguous). Different from Shortest Common Superstring (P157), which requires contiguous containment (substring). For two strings, SCS is solvable in polynomial time via the dual relationship with LCS (longest common subsequence). For an arbitrary number of strings, the problem becomes NP-complete even for |Sigma| = 5 (Maier, 1978). Applications include data compression, sequence alignment, and scheduling.
Associated rules:
- R102: MinimumVertexCover -> ShortestCommonSupersequence (as target)
Prerequisite note: The source rule R102 requires MinimumVertexCover which already exists in the codebase.
Definition
Name: ShortestCommonSupersequence
Canonical name: SHORTEST COMMON SUPERSEQUENCE
Reference: Garey & Johnson, Computers and Intractability, A4 SR8
Mathematical definition:
INSTANCE: Finite alphabet Sigma, finite set R of strings from Sigma*, and a positive integer K.
QUESTION: Is there a string w in Sigma* with |w| <= K such that each string x in R is a subsequence of w, i.e., x can be obtained from w by deleting zero or more characters (not necessarily contiguous)?
Variables
- Count: K variables (one per position in the candidate supersequence w), where K is the length bound.
- Per-variable domain: {0, 1, ..., |Sigma|-1} -- index into the alphabet Sigma.
- Meaning: Variable i encodes the symbol at position i of the candidate supersequence w. A satisfying assignment produces a string w of length <= K such that every string in R is a subsequence of w. Alternatively, the problem can be viewed as choosing an interleaving order for the symbols of all input strings.
dims() specification: vec![alphabet_size; bound] where alphabet_size = |Sigma| and bound = K. Each of the K positions can take any alphabet symbol.
Schema (data type)
Type name: ShortestCommonSupersequence
Variants: none
| Field |
Type |
Description |
alphabet_size |
usize |
Size of the alphabet |
strings |
Vec<Vec<usize>> |
Set R of input strings, each encoded as a vector of alphabet indices |
bound |
usize |
Length bound K |
Size fields (getter methods for overhead expressions and declare_variants!):
alphabet_size() — returns |Sigma|
num_strings() — returns |R|
bound() — returns K
total_length() — returns sum of |x| for x in R
Notes:
- This is a satisfaction (decision) problem:
Metric = bool, implementing SatisfactionProblem.
- A string x is a subsequence of w if x can be obtained by deleting characters from w (characters need not be contiguous in w).
- Different from ShortestCommonSuperstring where containment is as a contiguous substring.
- For the optimization version, minimize |w| subject to the subsequence constraint.
Complexity
- Best known exact algorithm:
- For |R| = 2: O(|x_1| * |x_2|) by DP via the dual LCS relationship: SCS(x_1, x_2) = |x_1| + |x_2| - LCS(x_1, x_2).
- For fixed |R|: O(product of |x_i| for i=1..|R|) by multi-dimensional DP.
- For arbitrary |R|: NP-hard. Exact algorithms are exponential. A* search with lower bounds has been applied. Brute force is O(|Sigma|^K) to enumerate all candidate strings.
- Approximation: The greedy algorithm (repeatedly pick the symbol that satisfies the most constraints) gives an O(|Sigma|)-approximation. No PTAS is known.
- NP-completeness: NP-complete (Maier, 1978). Remains NP-complete even if |Sigma| = 5. Also NP-complete over binary alphabet (Raiha and Ukkonen, 1981).
- Polynomial cases: Solvable in polynomial time if |R| = 2 or if all strings have length <= 2.
declare_variants! complexity string: "alphabet_size ^ bound" (brute-force enumeration of all candidate strings).
- References:
- David Maier (1978). "The complexity of some problems on subsequences and supersequences". JACM 25(2):322-336.
- K. J. Raiha and E. Ukkonen (1981). "The shortest common supersequence problem over binary alphabet is NP-complete". Theoretical Computer Science 16(2):187-198.
Extra Remark
Full book text:
INSTANCE: Finite alphabet Sigma, finite set R of strings from Sigma*, and a positive integer K.
QUESTION: Is there a string w in Sigma* with |w| <= K such that each string x in R is a subsequence of w, i.e., w = w0x1w1x2w2 ... xkwk where each wi in Sigma* and x = x1x2 ... xk?
Reference: [Maier, 1978]. Transformation from VERTEX COVER.
Comment: Remains NP-complete even if |Sigma| = 5. Solvable in polynomial time if |R| = 2 (by first computing the largest common subsequence) or if all x in R have |x| <= 2.
How to solve
Example Instance
Instance 1 (YES, three strings over ternary alphabet):
- Alphabet: Sigma = {a, b, c} (|Sigma| = 3)
- Strings: R = {"abcb", "bcab", "acba"}
- Bound K = 7
- Candidate supersequence: w = "abcacba" (length 7)
- "abcb" subsequence at positions 1,2,3,6
- "bcab" subsequence at positions 2,3,4,6
- "acba" subsequence at positions 1,3,6,7
- Answer: YES
- Exhaustive enumeration: search space 3⁷ = 2,187. Exactly 42 feasible supersequences exist, including "abcacba". (Used in
find_all_satisfying test)
Instance 2 (YES, six strings over binary alphabet):
- Alphabet: Sigma = {0, 1} (|Sigma| = 2)
- Strings: R = {"0110", "1010", "0101", "1100", "0011", "1001"}
- Bound K = 8
- Candidate supersequence: w = "01101001" (length 8)
- "0110" at positions 1,2,3,4
- "1010" at positions 2,4,5,6
- "0101" at positions 1,2,4,5
- "1100" at positions 2,3,4,6
- "0011" at positions 1,4,5,8
- "1001" at positions 2,4,6,8
- Answer: YES
Instance 3 (NO):
- Alphabet: Sigma = {a, b, c}
- Strings: R = {"abc", "bca", "cab", "acb", "bac", "cba"} (all 6 permutations of {a,b,c})
- Bound K = 5
- The minimum SCS of all permutations of 3 symbols has length 7. This follows from the known result that the shortest common supersequence of all n! permutations of n symbols has length n² − 2n + 4 for n ≥ 2 (for n = 3: 9 − 6 + 4 = 7). One such supersequence is "abcabca". Since 7 > 5, no supersequence of length <= 5 exists.
- Answer: NO
- Exhaustive enumeration: search space 3⁵ = 243. 0 feasible supersequences. (Used in
find_all_satisfying empty test)
Motivation
SHORTEST COMMON SUPERSEQUENCE from Garey & Johnson, A4 SR8. A fundamental NP-complete problem in string algorithms. Given a set of strings, find the shortest string that contains each input string as a subsequence (not necessarily contiguous). Different from Shortest Common Superstring (P157), which requires contiguous containment (substring). For two strings, SCS is solvable in polynomial time via the dual relationship with LCS (longest common subsequence). For an arbitrary number of strings, the problem becomes NP-complete even for |Sigma| = 5 (Maier, 1978). Applications include data compression, sequence alignment, and scheduling.
Associated rules:
Prerequisite note: The source rule R102 requires
MinimumVertexCoverwhich already exists in the codebase.Definition
Name:
ShortestCommonSupersequenceCanonical name: SHORTEST COMMON SUPERSEQUENCE
Reference: Garey & Johnson, Computers and Intractability, A4 SR8
Mathematical definition:
INSTANCE: Finite alphabet Sigma, finite set R of strings from Sigma*, and a positive integer K.
QUESTION: Is there a string w in Sigma* with |w| <= K such that each string x in R is a subsequence of w, i.e., x can be obtained from w by deleting zero or more characters (not necessarily contiguous)?
Variables
dims()specification:vec![alphabet_size; bound]wherealphabet_size= |Sigma| andbound= K. Each of the K positions can take any alphabet symbol.Schema (data type)
Type name:
ShortestCommonSupersequenceVariants: none
alphabet_sizeusizestringsVec<Vec<usize>>boundusizeSize fields (getter methods for overhead expressions and
declare_variants!):alphabet_size()— returns |Sigma|num_strings()— returns |R|bound()— returns Ktotal_length()— returns sum of |x| for x in RNotes:
Metric = bool, implementingSatisfactionProblem.Complexity
declare_variants!complexity string:"alphabet_size ^ bound"(brute-force enumeration of all candidate strings).Extra Remark
Full book text:
INSTANCE: Finite alphabet Sigma, finite set R of strings from Sigma*, and a positive integer K.
QUESTION: Is there a string w in Sigma* with |w| <= K such that each string x in R is a subsequence of w, i.e., w = w0x1w1x2w2 ... xkwk where each wi in Sigma* and x = x1x2 ... xk?
Reference: [Maier, 1978]. Transformation from VERTEX COVER.
Comment: Remains NP-complete even if |Sigma| = 5. Solvable in polynomial time if |R| = 2 (by first computing the largest common subsequence) or if all x in R have |x| <= 2.
How to solve
Example Instance
Instance 1 (YES, three strings over ternary alphabet):
find_all_satisfyingtest)Instance 2 (YES, six strings over binary alphabet):
Instance 3 (NO):
find_all_satisfyingempty test)