fix(venue-merge): require name similarity on address-based clusters to prevent merging different venues at shared addresses#282
Merged
Conversation
Address-based venue clustering in CheckMergeDuplicateVenuesCommand previously grouped every term sharing a normalized address+city into one cluster, ignoring names. At multi-tenant addresses (Oscars venue + bowling alley sharing 6801 Hollywood Blvd, taco shop + art space at 675 Pulaski St Athens, etc.) this produces destructive false-positive merges of unrelated venues. Add VenueMergeHelper::names_are_similar() — a three-rule guard that accepts an address cluster only if some name-level signal agrees: Rule 1: exact match after normalize_venue_name_for_matching() Rule 2: substring containment with a 4-char floor on the shorter name Rule 3: Jaccard token overlap >= 0.70 after stop-word removal If all three rules fail, the pair is treated as distinct venues even when they share a street address. Clustering uses complete-linkage (every pair within a sub-cluster must agree) to prevent transitive chaining where A~B and B~C but A and C are not similar. The name-cluster path is unchanged: normalized-name equality already implements Rule 1. Fixes #281. Refs production false-positive pairs from the issue body: - Dolby Theatre vs Lucky Strike Hollywood (6801 Hollywood Blvd) - Pancho's Tacos & Tequila vs ATHICA (675 Pulaski St) - North Charleston Coliseum vs NC Performing Arts Center - V Theater vs Saxe Theater at Planet Hollywood (token overlap 0.60) - Come and Take It Live vs Emo's Austin - The Arrow Room vs Haven City Market
Contributor
Homeboy Results —
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes #281.
CheckMergeDuplicateVenuesCommand(shipped in #278 / v0.37.2) clusters venue terms by name OR by address+city. The address-clustering path ignores names entirely, so multi-tenant buildings — where two unrelated venues legitimately share a street address — get merged into a single cluster.Today's first production dry-run flagged 18 clusters; 6 of them are false positives that would have caused destructive cross-venue merges if anyone ran
--applyblind.This PR adds a three-rule name-similarity guard on the address-cluster path. Two terms sharing an address+city stay clustered only if their names also agree by at least one rule. The name-cluster path is unchanged (it already requires normalized-name equality, which IS Rule 1).
The six production false-positive pairs (now rejected)
The three similarity rules
A pair is accepted as "similar" if any rule passes:
Venue_Taxonomy::normalize_venue_name_for_matching(), the two names are byte-equal. Handles trivial case/punctuation variants (e.g.Hi-Fi IndianapolisvsHI-FI Indianapolis).The AbbeyvsThe Abbey-Orlando.the, a, an, and, of, at), require intersection/union >= 0.70. HandlesBowery Ballroom NYCvsNYC Bowery Ballroom(token reorder, 3/3 = 1.0). Correctly rejectsV Theater at Planet HollywoodvsSaxe Theater at Planet Hollywood(3/5 = 0.60, below threshold).If all three fail, the pair is dissimilar and must not be clustered even when addresses match.
The 0.70 threshold is intentionally strict — stricter is safer for a destructive merge operation. The issue body and code constant
NAME_SIMILARITY_TOKEN_OVERLAP_THRESHOLDdocument this rationale.Address-cluster splitting
When an address bucket holds 3+ terms with mixed similarity (e.g. two case-variants of "Hi-Fi Indianapolis" plus an unrelated "Bowling Alley" at the same address), the bucket is subdivided by complete-linkage clustering: a term joins a sub-cluster only if it is name-similar to every existing member, not just one. This prevents transitive chaining where A
B and BC but A and C are not similar — which matters for a destructive merge.Surviving sub-clusters are reported with a disambiguating
#Nsuffix in theircluster_keyso the operator can tell sibling sub-clusters apart in the dry-run output.Test cases
All in
tests/Unit/VenueMergeHelperTest.php. The three-rule unit tests and the production fixtures were also verified standalone (PHP 8.4) — 16/16 pass.names_are_similar()unit teststest_names_are_similar_exact_normalized_match— case-only variant → Rule 1.test_names_are_similar_substring_containment—The AbbeyvsThe Abbey-Orlando→ Rule 2.test_names_are_similar_token_overlap_above_threshold— token reorderBowery Ballroom NYC↔NYC Bowery Ballroom→ Rule 3 (1.0).test_names_are_similar_token_overlap_below_threshold—V Theater at Planet HollywoodvsSaxe Theater at Planet Hollywood→ 0.60, rejected.test_names_are_similar_completely_different—Dolby TheatrevsLucky Strike Hollywood→ no overlap.test_names_are_similar_taco_vs_art_space—Pancho's Tacos & TequilavsATHICA→ no overlap.test_names_are_similar_short_substring_at_threshold— documents the 4-char floor edge case (JoesvsJoe's Bar and Grillpasses Rule 2; impact bounded because both must already share address+city).test_names_are_similar_empty_or_whitespace— empty / whitespace / single char → false.Regression fixtures
test_production_false_positive_pairs_rejected— all six pairs from the issue body return false.test_production_true_positive_pairs_accepted—Hi-Fi Indianapoliscase-variant,The Abbeysuffix variant,Hook & Ladderampersand,St Augustine Amphitheatre"The" stripping all return true.Address-cluster integration tests
test_address_cluster_excludes_dissimilar_names— two dissimilar terms at the same address produce noaddr:cluster.test_address_cluster_includes_similar_names— two similar-named terms at the same address are clustered.test_address_cluster_splits_multi_tenant_with_mixed_pair— three terms at one address (two similar + one unrelated) produces a 2-term sub-cluster excluding the intruder.test_name_cluster_unchanged— regression guard: name-only clusters still work.Test runner status
## Pre-existing test bootstrap blockerhomeboy testfails on this branch AND on plainmainwith the same error:That is an upstream
data-machine/agents-apiplugin dependency missing from the test playground, not anything this PR introduces. Verified by stashing the diff and re-running on a cleanmain— identical failure.To compensate, the three-rule logic was exercised standalone against a minimal harness loading only
VenueMergeHelper+ a copy ofnormalize_venue_name_for_matching(). All 16 fixture cases pass (every unit test in this PR plus all 6 production false-positives plus all 4 production true-positives).homeboy lint --changed-since mainreports 9 findings inside the two touched source files, all on pre-existing lines I did not modify. None of my added lines introduce new lint findings.Verification after merge + deploy
Re-run on extrachill.com:
Expected: cluster count drops from 18 to ~11 (the 6 multi-tenant false positives disappear; legitimate name-collision and suite-variant clusters survive).
Out of scope