Skip to content

[Model] Clustering #927

@isPANN

Description

@isPANN

Motivation

CLUSTERING (MS9) from Garey & Johnson, A1.2. Given a finite set of elements with pairwise integer distances, determine whether the elements can be partitioned into at most K clusters such that all intra-cluster pairwise distances are at most B. NP-complete even for K=3 and distances restricted to {0,1} (Brucker 1978). Polynomial for K=2 via a graph threshold approach. Applications in data mining, pattern recognition, bioinformatics, and unsupervised learning.

Associated rules:

Definition

Name: Clustering
Reference: Garey & Johnson, Computers and Intractability, A1.2 MS9; Brucker 1978

Mathematical definition:

INSTANCE: Finite set X, nonneg integer distance d(x,y) for each pair x,y in X with d(x,y) = d(y,x) and d(x,x) = 0, positive integers K and B.
QUESTION: Can X be partitioned into k <= K disjoint sets X_1, X_2, ..., X_k such that for each i, for every pair x,y in X_i, d(x,y) <= B?

Variables

  • Count: |X| (one per element)
  • Per-variable domain: {0, 1, ..., K-1}
  • Meaning: The index of the cluster to which each element is assigned. Every cluster must satisfy the diameter constraint: all pairwise distances within the cluster are at most B.

Schema (data type)

Type name: Clustering
Variants: none (unique input structure)

Field Type Description
distances Vec<Vec<u64>> Symmetric distance matrix d(x,y) for elements in X
num_clusters usize Maximum number of clusters K
diameter_bound u64 Maximum allowed intra-cluster distance B

Notes:

  • This is a feasibility (decision) problem: Value = Or.
  • Key getter methods: num_elements() (= |X|), num_clusters() (= K), diameter_bound() (= B).

Complexity

  • Decision complexity: NP-complete for any fixed K >= 3 with distances in {0,1} (Brucker 1978; transformation from Graph 3-Colorability). Polynomial for K = 2.
  • Best known exact algorithm: O*(K^n) by brute-force enumeration of all K-partitions. The actual number of distinct partitions is given by Stirling numbers of the second kind S(n, K), which is bounded above by K^n / K!. No significantly better exact algorithm is known for general K.
  • Concrete complexity expression: "num_clusters^num_elements" (for declare_variants!)
  • Special cases:
    • K = 1: feasible iff all pairwise distances are at most B (checkable in O(n^2)).
    • K = 2: polynomial — construct the "threshold graph" with edges for pairs with distance > B, then check if it is bipartite.
    • K >= |X|: trivially YES (each element in its own singleton cluster).
  • References:
    • P. Brucker (1978). "On the complexity of clustering problems." In Optimization and Operations Research, LNEMS vol. 157, pp. 45-54, Springer.
    • M.R. Garey and D.S. Johnson (1979). Computers and Intractability. W.H. Freeman, pp. 226 (MS9).

Extra Remark

Full book text:

INSTANCE: Finite set X, "distance" d(x,y) in Z+ for each pair x,y in X, positive integers K and B.
QUESTION: Can X be partitioned into k <= K disjoint sets X_1, X_2, ..., X_k such that, for 1 <= i <= k, d(x,y) <= B for all x,y in X_i?

Reference: [Brucker, 1978]. Transformation from GRAPH 3-COLORABILITY.
Comment: NP-complete even for K = 3 and d restricted to {0,1}. Polynomial for K = 2.

Note on distance domain: The GJ text states d(x,y) in Z+ (positive integers), but the standard formulation requires d(x,x) = 0, which is not in Z+. The schema uses u64 (non-negative integers, Z>=0), which is the correct domain for a pseudometric.

How to solve

  • It can be solved by (existing) bruteforce.
  • It can be solved by reducing to integer programming.
  • Other:

Note: Bruteforce enumerates all K-colorings (partitions) of elements and checks the diameter constraint for each cluster.

Example Instance

Input:
6 elements X = {0, 1, 2, 3, 4, 5} with distance matrix:

0 1 2 3 4 5
0 0 1 1 3 3 3
1 1 0 1 3 3 3
2 1 1 0 3 3 3
3 3 3 3 0 1 1
4 3 3 3 1 0 1
5 3 3 3 1 1 0

Two natural groups: {0,1,2} with intra-distances 1, and {3,4,5} with intra-distances 1. Cross-group distances are all 3.

K = 2, B = 1 (feasible):
Partition: X_1 = {0, 1, 2}, X_2 = {3, 4, 5}.
Verification: max intra-distance in X_1 = d(0,1) = d(0,2) = d(1,2) = 1 <= 1. Max intra-distance in X_2 = d(3,4) = d(3,5) = d(4,5) = 1 <= 1.
Answer: YES.

K = 1, B = 1 (infeasible):
All 6 elements in one cluster. d(0,3) = 3 > 1. No single cluster can contain both groups.
Answer: NO.

K = 2, B = 0 (infeasible):
Within each group, distances are 1 > 0. Even putting {0,1,2} together violates B = 0. We would need K >= 6 (all singletons) to satisfy B = 0.
Answer: NO.

Expected outcome for primary instance (K=2, B=1):
Answer: YES. Partition: X_1 = {0, 1, 2}, X_2 = {3, 4, 5}. This is the unique valid 2-clustering up to relabeling.

Python validation script
from itertools import product

def check_clustering(distances, num_clusters, diameter_bound):
    """Brute-force check if a valid clustering exists."""
    n = len(distances)
    for assignment in product(range(num_clusters), repeat=n):
        valid = True
        for c in range(num_clusters):
            members = [i for i in range(n) if assignment[i] == c]
            for i in range(len(members)):
                for j in range(i + 1, len(members)):
                    if distances[members[i]][members[j]] > diameter_bound:
                        valid = False
                        break
                if not valid:
                    break
            if not valid:
                break
        if valid:
            return True, assignment
    return False, None

distances = [
    [0, 1, 1, 3, 3, 3],
    [1, 0, 1, 3, 3, 3],
    [1, 1, 0, 3, 3, 3],
    [3, 3, 3, 0, 1, 1],
    [3, 3, 3, 1, 0, 1],
    [3, 3, 3, 1, 1, 0],
]

# K=2, B=1 → YES
result, assignment = check_clustering(distances, 2, 1)
assert result == True, "K=2, B=1 should be YES"
clusters = {}
for i, c in enumerate(assignment):
    clusters.setdefault(c, []).append(i)
print(f"K=2, B=1: YES, partition = {clusters}")

# K=1, B=1 → NO
result, _ = check_clustering(distances, 1, 1)
assert result == False, "K=1, B=1 should be NO"
print("K=1, B=1: NO (correct)")

# K=2, B=0 → NO
result, _ = check_clustering(distances, 2, 0)
assert result == False, "K=2, B=0 should be NO"
print("K=2, B=0: NO (correct)")

# K=1, B=3 → YES (all in one cluster)
result, _ = check_clustering(distances, 1, 3)
assert result == True, "K=1, B=3 should be YES"
print("K=1, B=3: YES (correct)")

print("All assertions passed!")

Metadata

Metadata

Assignees

No one assigned

    Labels

    GoodAn issue passed all checks.modelA model problem to be implemented.

    Type

    No type

    Projects

    Status

    Done

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions