Skip to content

fix(venue-merge): require name similarity on address-based clusters to prevent merging different venues at shared addresses#282

Merged
chubes4 merged 1 commit into
mainfrom
fix-venue-merge-name-similarity
May 18, 2026
Merged

fix(venue-merge): require name similarity on address-based clusters to prevent merging different venues at shared addresses#282
chubes4 merged 1 commit into
mainfrom
fix-venue-merge-name-similarity

Conversation

@chubes4
Copy link
Copy Markdown
Member

@chubes4 chubes4 commented May 18, 2026

Summary

Fixes #281.

CheckMergeDuplicateVenuesCommand (shipped in #278 / v0.37.2) clusters venue terms by name OR by address+city. The address-clustering path ignores names entirely, so multi-tenant buildings — where two unrelated venues legitimately share a street address — get merged into a single cluster.

Today's first production dry-run flagged 18 clusters; 6 of them are false positives that would have caused destructive cross-venue merges if anyone ran --apply blind.

This PR adds a three-rule name-similarity guard on the address-cluster path. Two terms sharing an address+city stay clustered only if their names also agree by at least one rule. The name-cluster path is unchanged (it already requires normalized-name equality, which IS Rule 1).

The six production false-positive pairs (now rejected)

Address Pair
6801 Hollywood Blvd Dolby Theatre vs Lucky Strike Hollywood (Oscars venue vs bowling alley)
2015 E Riverside Dr, Austin Come and Take It Live vs Emo's Austin
675 Pulaski St, Athens Pancho's Tacos & Tequila vs ATHICA (taco shop vs art space)
5001 Coliseum Dr, North Charleston North Charleston Coliseum vs NC Performing Arts Center
3663 Las Vegas Blvd S V Theater vs Saxe Theater at Planet Hollywood
8443 Haven Ave, Rancho Cucamonga The Arrow Room vs Haven City Market

The three similarity rules

A pair is accepted as "similar" if any rule passes:

  1. Exact normalized name match — after Venue_Taxonomy::normalize_venue_name_for_matching(), the two names are byte-equal. Handles trivial case/punctuation variants (e.g. Hi-Fi Indianapolis vs HI-FI Indianapolis).
  2. Substring containment — one normalized name contains the other, and the shorter normalized name is at least 4 characters. Handles The Abbey vs The Abbey-Orlando.
  3. Jaccard token overlap >= 0.70 — tokenize on whitespace, strip stop-words (the, a, an, and, of, at), require intersection/union >= 0.70. Handles Bowery Ballroom NYC vs NYC Bowery Ballroom (token reorder, 3/3 = 1.0). Correctly rejects V Theater at Planet Hollywood vs Saxe Theater at Planet Hollywood (3/5 = 0.60, below threshold).

If all three fail, the pair is dissimilar and must not be clustered even when addresses match.

The 0.70 threshold is intentionally strict — stricter is safer for a destructive merge operation. The issue body and code constant NAME_SIMILARITY_TOKEN_OVERLAP_THRESHOLD document this rationale.

Address-cluster splitting

When an address bucket holds 3+ terms with mixed similarity (e.g. two case-variants of "Hi-Fi Indianapolis" plus an unrelated "Bowling Alley" at the same address), the bucket is subdivided by complete-linkage clustering: a term joins a sub-cluster only if it is name-similar to every existing member, not just one. This prevents transitive chaining where AB and BC but A and C are not similar — which matters for a destructive merge.

Surviving sub-clusters are reported with a disambiguating #N suffix in their cluster_key so the operator can tell sibling sub-clusters apart in the dry-run output.

Test cases

All in tests/Unit/VenueMergeHelperTest.php. The three-rule unit tests and the production fixtures were also verified standalone (PHP 8.4) — 16/16 pass.

names_are_similar() unit tests

  • test_names_are_similar_exact_normalized_match — case-only variant → Rule 1.
  • test_names_are_similar_substring_containmentThe Abbey vs The Abbey-Orlando → Rule 2.
  • test_names_are_similar_token_overlap_above_threshold — token reorder Bowery Ballroom NYCNYC Bowery Ballroom → Rule 3 (1.0).
  • test_names_are_similar_token_overlap_below_thresholdV Theater at Planet Hollywood vs Saxe Theater at Planet Hollywood → 0.60, rejected.
  • test_names_are_similar_completely_differentDolby Theatre vs Lucky Strike Hollywood → no overlap.
  • test_names_are_similar_taco_vs_art_spacePancho's Tacos & Tequila vs ATHICA → no overlap.
  • test_names_are_similar_short_substring_at_threshold — documents the 4-char floor edge case (Joes vs Joe's Bar and Grill passes Rule 2; impact bounded because both must already share address+city).
  • test_names_are_similar_empty_or_whitespace — empty / whitespace / single char → false.

Regression fixtures

  • test_production_false_positive_pairs_rejected — all six pairs from the issue body return false.
  • test_production_true_positive_pairs_acceptedHi-Fi Indianapolis case-variant, The Abbey suffix variant, Hook & Ladder ampersand, St Augustine Amphitheatre "The" stripping all return true.

Address-cluster integration tests

  • test_address_cluster_excludes_dissimilar_names — two dissimilar terms at the same address produce no addr: cluster.
  • test_address_cluster_includes_similar_names — two similar-named terms at the same address are clustered.
  • test_address_cluster_splits_multi_tenant_with_mixed_pair — three terms at one address (two similar + one unrelated) produces a 2-term sub-cluster excluding the intruder.
  • test_name_cluster_unchanged — regression guard: name-only clusters still work.

Test runner status

## Pre-existing test bootstrap blocker

homeboy test fails on this branch AND on plain main with the same error:

BOOTSTRAP FAILURE: load_deps:Error: Interface "WP_Agent_Token_Store" not found
at /wordpress/wp-content/plugins/data-machine/inc/Core/Database/Agents/AgentTokens.php:24

That is an upstream data-machine / agents-api plugin dependency missing from the test playground, not anything this PR introduces. Verified by stashing the diff and re-running on a clean main — identical failure.

To compensate, the three-rule logic was exercised standalone against a minimal harness loading only VenueMergeHelper + a copy of normalize_venue_name_for_matching(). All 16 fixture cases pass (every unit test in this PR plus all 6 production false-positives plus all 4 production true-positives).

homeboy lint --changed-since main reports 9 findings inside the two touched source files, all on pre-existing lines I did not modify. None of my added lines introduce new lint findings.

Verification after merge + deploy

Re-run on extrachill.com:

wp data-machine-events check merge-duplicate-venues --dry-run

Expected: cluster count drops from 18 to ~11 (the 6 multi-tenant false positives disappear; legitimate name-collision and suite-variant clusters survive).

Out of scope

  • Levenshtein-distance fuzzy matching beyond the three documented rules — adds false-positive risk on a destructive operation.
  • Lowering the 0.70 Jaccard threshold to capture more matches — stricter is safer.
  • Touching the name-cluster path — it already implements Rule 1 correctly.

Address-based venue clustering in CheckMergeDuplicateVenuesCommand
previously grouped every term sharing a normalized address+city into
one cluster, ignoring names. At multi-tenant addresses (Oscars venue
+ bowling alley sharing 6801 Hollywood Blvd, taco shop + art space at
675 Pulaski St Athens, etc.) this produces destructive false-positive
merges of unrelated venues.

Add VenueMergeHelper::names_are_similar() — a three-rule guard that
accepts an address cluster only if some name-level signal agrees:

  Rule 1: exact match after normalize_venue_name_for_matching()
  Rule 2: substring containment with a 4-char floor on the shorter name
  Rule 3: Jaccard token overlap >= 0.70 after stop-word removal

If all three rules fail, the pair is treated as distinct venues even
when they share a street address. Clustering uses complete-linkage
(every pair within a sub-cluster must agree) to prevent transitive
chaining where A~B and B~C but A and C are not similar.

The name-cluster path is unchanged: normalized-name equality already
implements Rule 1.

Fixes #281.

Refs production false-positive pairs from the issue body:
  - Dolby Theatre vs Lucky Strike Hollywood (6801 Hollywood Blvd)
  - Pancho's Tacos & Tequila vs ATHICA (675 Pulaski St)
  - North Charleston Coliseum vs NC Performing Arts Center
  - V Theater vs Saxe Theater at Planet Hollywood (token overlap 0.60)
  - Come and Take It Live vs Emo's Austin
  - The Arrow Room vs Haven City Market
@homeboy-ci
Copy link
Copy Markdown
Contributor

homeboy-ci Bot commented May 18, 2026

Homeboy Results — data-machine-events

Audit

audit — passed

  • intra-method-duplication — 3 finding(s)
  • DuplicateDetection — 1 finding(s)
  • test_coverage — 1 finding(s)
  • Total: 5 finding(s)

Deep dive: homeboy audit data-machine-events --changed-since 20b2466

Tooling versions
  • Homeboy CLI: homeboy 0.182.0+6737bba
  • Extension: wordpress from https://github.com/Extra-Chill/homeboy-extensions
  • Extension revision: dd47f26a
  • Action: Extra-Chill/homeboy-action@v2

@chubes4 chubes4 merged commit 02e6f1f into main May 18, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

fix(venue-merge): address-based clusters must require name similarity to prevent merging different venues at shared addresses

1 participant