Motivation
CLUSTERING (MS9) from Garey & Johnson, A1.2. Given a finite set of elements with pairwise integer distances, determine whether the elements can be partitioned into at most K clusters such that all intra-cluster pairwise distances are at most B. NP-complete even for K=3 and distances restricted to {0,1} (Brucker 1978). Polynomial for K=2 via a graph threshold approach. Applications in data mining, pattern recognition, bioinformatics, and unsupervised learning.
Associated rules:
Definition
Name: Clustering
Reference: Garey & Johnson, Computers and Intractability, A1.2 MS9; Brucker 1978
Mathematical definition:
INSTANCE: Finite set X, nonneg integer distance d(x,y) for each pair x,y in X with d(x,y) = d(y,x) and d(x,x) = 0, positive integers K and B.
QUESTION: Can X be partitioned into k <= K disjoint sets X_1, X_2, ..., X_k such that for each i, for every pair x,y in X_i, d(x,y) <= B?
Variables
- Count: |X| (one per element)
- Per-variable domain: {0, 1, ..., K-1}
- Meaning: The index of the cluster to which each element is assigned. Every cluster must satisfy the diameter constraint: all pairwise distances within the cluster are at most B.
Schema (data type)
Type name: Clustering
Variants: none (unique input structure)
| Field |
Type |
Description |
distances |
Vec<Vec<u64>> |
Symmetric distance matrix d(x,y) for elements in X |
num_clusters |
usize |
Maximum number of clusters K |
diameter_bound |
u64 |
Maximum allowed intra-cluster distance B |
Notes:
- This is a feasibility (decision) problem:
Value = Or.
- Key getter methods:
num_elements() (= |X|), num_clusters() (= K), diameter_bound() (= B).
Complexity
- Decision complexity: NP-complete for any fixed K >= 3 with distances in {0,1} (Brucker 1978; transformation from Graph 3-Colorability). Polynomial for K = 2.
- Best known exact algorithm: O*(K^n) by brute-force enumeration of all K-partitions. The actual number of distinct partitions is given by Stirling numbers of the second kind S(n, K), which is bounded above by K^n / K!. No significantly better exact algorithm is known for general K.
- Concrete complexity expression:
"num_clusters^num_elements" (for declare_variants!)
- Special cases:
- K = 1: feasible iff all pairwise distances are at most B (checkable in O(n^2)).
- K = 2: polynomial — construct the "threshold graph" with edges for pairs with distance > B, then check if it is bipartite.
- K >= |X|: trivially YES (each element in its own singleton cluster).
- References:
- P. Brucker (1978). "On the complexity of clustering problems." In Optimization and Operations Research, LNEMS vol. 157, pp. 45-54, Springer.
- M.R. Garey and D.S. Johnson (1979). Computers and Intractability. W.H. Freeman, pp. 226 (MS9).
Extra Remark
Full book text:
INSTANCE: Finite set X, "distance" d(x,y) in Z+ for each pair x,y in X, positive integers K and B.
QUESTION: Can X be partitioned into k <= K disjoint sets X_1, X_2, ..., X_k such that, for 1 <= i <= k, d(x,y) <= B for all x,y in X_i?
Reference: [Brucker, 1978]. Transformation from GRAPH 3-COLORABILITY.
Comment: NP-complete even for K = 3 and d restricted to {0,1}. Polynomial for K = 2.
Note on distance domain: The GJ text states d(x,y) in Z+ (positive integers), but the standard formulation requires d(x,x) = 0, which is not in Z+. The schema uses u64 (non-negative integers, Z>=0), which is the correct domain for a pseudometric.
How to solve
Note: Bruteforce enumerates all K-colorings (partitions) of elements and checks the diameter constraint for each cluster.
Example Instance
Input:
6 elements X = {0, 1, 2, 3, 4, 5} with distance matrix:
|
0 |
1 |
2 |
3 |
4 |
5 |
| 0 |
0 |
1 |
1 |
3 |
3 |
3 |
| 1 |
1 |
0 |
1 |
3 |
3 |
3 |
| 2 |
1 |
1 |
0 |
3 |
3 |
3 |
| 3 |
3 |
3 |
3 |
0 |
1 |
1 |
| 4 |
3 |
3 |
3 |
1 |
0 |
1 |
| 5 |
3 |
3 |
3 |
1 |
1 |
0 |
Two natural groups: {0,1,2} with intra-distances 1, and {3,4,5} with intra-distances 1. Cross-group distances are all 3.
K = 2, B = 1 (feasible):
Partition: X_1 = {0, 1, 2}, X_2 = {3, 4, 5}.
Verification: max intra-distance in X_1 = d(0,1) = d(0,2) = d(1,2) = 1 <= 1. Max intra-distance in X_2 = d(3,4) = d(3,5) = d(4,5) = 1 <= 1.
Answer: YES.
K = 1, B = 1 (infeasible):
All 6 elements in one cluster. d(0,3) = 3 > 1. No single cluster can contain both groups.
Answer: NO.
K = 2, B = 0 (infeasible):
Within each group, distances are 1 > 0. Even putting {0,1,2} together violates B = 0. We would need K >= 6 (all singletons) to satisfy B = 0.
Answer: NO.
Expected outcome for primary instance (K=2, B=1):
Answer: YES. Partition: X_1 = {0, 1, 2}, X_2 = {3, 4, 5}. This is the unique valid 2-clustering up to relabeling.
Python validation script
from itertools import product
def check_clustering(distances, num_clusters, diameter_bound):
"""Brute-force check if a valid clustering exists."""
n = len(distances)
for assignment in product(range(num_clusters), repeat=n):
valid = True
for c in range(num_clusters):
members = [i for i in range(n) if assignment[i] == c]
for i in range(len(members)):
for j in range(i + 1, len(members)):
if distances[members[i]][members[j]] > diameter_bound:
valid = False
break
if not valid:
break
if not valid:
break
if valid:
return True, assignment
return False, None
distances = [
[0, 1, 1, 3, 3, 3],
[1, 0, 1, 3, 3, 3],
[1, 1, 0, 3, 3, 3],
[3, 3, 3, 0, 1, 1],
[3, 3, 3, 1, 0, 1],
[3, 3, 3, 1, 1, 0],
]
# K=2, B=1 → YES
result, assignment = check_clustering(distances, 2, 1)
assert result == True, "K=2, B=1 should be YES"
clusters = {}
for i, c in enumerate(assignment):
clusters.setdefault(c, []).append(i)
print(f"K=2, B=1: YES, partition = {clusters}")
# K=1, B=1 → NO
result, _ = check_clustering(distances, 1, 1)
assert result == False, "K=1, B=1 should be NO"
print("K=1, B=1: NO (correct)")
# K=2, B=0 → NO
result, _ = check_clustering(distances, 2, 0)
assert result == False, "K=2, B=0 should be NO"
print("K=2, B=0: NO (correct)")
# K=1, B=3 → YES (all in one cluster)
result, _ = check_clustering(distances, 1, 3)
assert result == True, "K=1, B=3 should be YES"
print("K=1, B=3: YES (correct)")
print("All assertions passed!")
Motivation
CLUSTERING (MS9) from Garey & Johnson, A1.2. Given a finite set of elements with pairwise integer distances, determine whether the elements can be partitioned into at most K clusters such that all intra-cluster pairwise distances are at most B. NP-complete even for K=3 and distances restricted to {0,1} (Brucker 1978). Polynomial for K=2 via a graph threshold approach. Applications in data mining, pattern recognition, bioinformatics, and unsupervised learning.
Associated rules:
Definition
Name:
ClusteringReference: Garey & Johnson, Computers and Intractability, A1.2 MS9; Brucker 1978
Mathematical definition:
INSTANCE: Finite set X, nonneg integer distance d(x,y) for each pair x,y in X with d(x,y) = d(y,x) and d(x,x) = 0, positive integers K and B.
QUESTION: Can X be partitioned into k <= K disjoint sets X_1, X_2, ..., X_k such that for each i, for every pair x,y in X_i, d(x,y) <= B?
Variables
Schema (data type)
Type name:
ClusteringVariants: none (unique input structure)
distancesVec<Vec<u64>>num_clustersusizediameter_boundu64Notes:
Value = Or.num_elements()(= |X|),num_clusters()(= K),diameter_bound()(= B).Complexity
"num_clusters^num_elements"(fordeclare_variants!)Extra Remark
Full book text:
INSTANCE: Finite set X, "distance" d(x,y) in Z+ for each pair x,y in X, positive integers K and B.
QUESTION: Can X be partitioned into k <= K disjoint sets X_1, X_2, ..., X_k such that, for 1 <= i <= k, d(x,y) <= B for all x,y in X_i?
Reference: [Brucker, 1978]. Transformation from GRAPH 3-COLORABILITY.
Comment: NP-complete even for K = 3 and d restricted to {0,1}. Polynomial for K = 2.
Note on distance domain: The GJ text states d(x,y) in Z+ (positive integers), but the standard formulation requires d(x,x) = 0, which is not in Z+. The schema uses
u64(non-negative integers, Z>=0), which is the correct domain for a pseudometric.How to solve
Note: Bruteforce enumerates all K-colorings (partitions) of elements and checks the diameter constraint for each cluster.
Example Instance
Input:
6 elements X = {0, 1, 2, 3, 4, 5} with distance matrix:
Two natural groups: {0,1,2} with intra-distances 1, and {3,4,5} with intra-distances 1. Cross-group distances are all 3.
K = 2, B = 1 (feasible):
Partition: X_1 = {0, 1, 2}, X_2 = {3, 4, 5}.
Verification: max intra-distance in X_1 = d(0,1) = d(0,2) = d(1,2) = 1 <= 1. Max intra-distance in X_2 = d(3,4) = d(3,5) = d(4,5) = 1 <= 1.
Answer: YES.
K = 1, B = 1 (infeasible):
All 6 elements in one cluster. d(0,3) = 3 > 1. No single cluster can contain both groups.
Answer: NO.
K = 2, B = 0 (infeasible):
Within each group, distances are 1 > 0. Even putting {0,1,2} together violates B = 0. We would need K >= 6 (all singletons) to satisfy B = 0.
Answer: NO.
Expected outcome for primary instance (K=2, B=1):
Answer: YES. Partition: X_1 = {0, 1, 2}, X_2 = {3, 4, 5}. This is the unique valid 2-clustering up to relabeling.
Python validation script