fix(venue-dedup): normalize ampersand/HTML-entity/apostrophe in names + strip suite suffixes from addresses + migration CLI#278
Merged
Conversation
… + strip suite suffixes from addresses Tightens the two venue-collision normalizers introduced by PR #252 (address-aware venue resolution) so today's 6 production-confirmed duplicate venue-term clusters stop reproducing on every discovery batch. Name normalization (Venue_Taxonomy::normalize_venue_name_for_matching): - html_entity_decode is already first; now also converts every '&' variant (spaced, tight, decoded from '&') to literal ' and ' before the alphanumeric strip. 'Hook & Ladder Theater', 'Hook and Ladder Theater', and 'Hook & Ladder Theater' now collapse to the same key. - Apostrophes were already stripped via the non-alphanumeric pass; documented alongside the ampersand handling so future readers see both collisions covered here. Address normalization (Venue_Taxonomy::normalize_address_for_matching): - Strips suite/unit/apt/apartment/room/rm/#NNN suffixes before the existing street-suffix replacements so '3010 Minnehaha Ave STE 420' collapses to '3010 minnehaha ave'. Two passes: word-boundary suffixes plus a dedicated '#NNN' pattern (since '#' is non-word and word-boundary anchors do not catch it). - Trailing comma/dash from the strip is trimmed so substring comparison stays clean. - Method is now public so the upcoming migration helper and unit tests can call it directly. Adds VenueMergeHelper (inc/Core/DuplicateDetection) — the term-level analogue of EventMergeHelper. Single source of truth for 'given a winner term and loser term, smart-merge meta + reassign posts + rewrite flow handler_config refs + delete loser', in that exact order. Reassignment is via wp_set_object_terms so tt_count cache stays correct (no raw term_relationships SQL). Flow handler_config rewriting covers both flat ('handler_config.venue') and nested ('handler_config.universal_web_scraper.venue') shapes observed in production. Respects a _venue_no_merge=1 opt-out flag on either side. Tests (VenueNormalizationTest): ampersand/HTML-entity collapse, apostrophe collapse (Amos Southend, Cliff Bell's, Proctor's Theatre), idempotency for both normalizers, suite/unit/apt/#NNN address collapse, plus two integration tests that drive find_or_create_venue twice with ampersand and suite-address variants and assert the same term_id comes back. Refs #276
…dation Adds 'wp data-machine-events check merge-duplicate-venues' — a one-time migration that consolidates duplicate venue terms produced before PR #252 (address-aware venue resolution) and the issue #276 normalization tightening shipped. Behavior: 1. Walks every venue term. Computes both the normalized name and the normalized address+city keys via the now-public Venue_Taxonomy helpers. Groups by either key; emits any group with >=2 terms. Each term is emitted in at most one cluster (name pass before address pass) to avoid double-processing. 2. Per cluster, picks the LOWEST term_id as winner (oldest, most inbound link equity) and treats every other term as a loser. 3. Hands each (winner, loser) pair to VenueMergeHelper::merge which: a. Smart-merges loser meta into winner (fill empties only — never overwrites winner data). b. Reassigns every post tagged with the loser to the winner via wp_set_object_terms() so tt_count stays correct. c. Rewrites flow handler_config references in {prefix}datamachine_flows that point at the loser term_id (both 'handler_config.venue' and 'handler_config.universal_web_scraper.venue' shapes). d. Deletes the loser term last (order matters: reassign first, delete last, or inbound references orphan). e. Emits a structured 'venue_merge' info log via the datamachine_log action for the audit trail. CLI surface: wp data-machine-events check merge-duplicate-venues [--dry-run] [--apply] [--limit=N] [--format=table|csv|json] Defaults to dry-run (safer); --apply commits. --limit caps clusters processed per run (default 50). Output table columns: cluster_key, winner_id, winner_name, loser_ids, loser_names, total_posts_reassigned, total_flows_reassigned, action_taken. Respects a _venue_no_merge=1 term-meta opt-out on either side of a cluster (operator-driven protection for legit multi-room buildings or otherwise-similar venues that must stay separate). Tests (VenueMergeHelperTest): - dry-run lists clusters without writing - apply reassigns posts and deletes loser; winner inherits empty meta fields from loser; pre-existing winner fields are preserved - apply rewrites flow handler_config refs in both flat and nested shapes - apply respects the _venue_no_merge opt-out (both terms intact) - smart-merge does not overwrite an existing winner address even when loser carries a suite-suffix variant Refs #276
Contributor
Homeboy Results —
|
This was referenced May 18, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes #276.
Today's discovery batches keep producing duplicate venue terms because the two collision normalizers introduced by PR #252 (address-aware venue resolution) miss two well-defined cases:
&↔and, does not decode&, and never normalized apostrophes against terms that were created by pre-PR-252 codepaths.STE,Suite,Unit,Apt,#NNN, etc. — so3010 Minnehaha Ave STE 420and3010 Minnehaha Avelook like two different addresses.A network audit today surfaced 6 production-confirmed collision clusters:
Amos Southend‖ 50476Amos' SouthendCliff Bells‖ 50459Cliff Bell'sGuitars and Cadillacs‖ 48049Guitars & CadillacsHarvard & Stone‖ 38934Harvard and StoneHook and Ladder Theater‖ 50460Hook & Ladder TheaterProctor's Theatre‖ 27297Proctors TheatrePlus the suite-suffix variant for Hook & Ladder Theater:
3010 Minnehaha Ave STE 420(Minneapolis)3010 Minnehaha Ave(Minneapolis)Each pair is the same physical venue with two term IDs, two stat lines, two location-archive entries, and split event counts. PR #252 will not fix this on its own — the existing duplicates need a one-time migration.
What changed
Part A — name normalization (
Venue_Taxonomy::normalize_venue_name_for_matching)html_entity_decodewas already first; now also converts every&variant (spaced, tight, decoded from&) to literalandbefore the alphanumeric strip.After this:
Hook & Ladder Theater→hook and ladder theaterHook and Ladder Theater→hook and ladder theaterHook & Ladder Theater→hook and ladder theaterAmos Southend,Amos' Southend→amos southendCliff Bells,Cliff Bell's→cliff bellsPart B — address normalization (
Venue_Taxonomy::normalize_address_for_matching)ste/suite/unit/apt/apartment/room/rm/#NNNsuffixes before the existing street-suffix replacements. Two passes: word-boundary suffixes plus a dedicated#NNNpattern (since#is non-word and\bdoes not anchor against it).publicso the migration helper and unit tests can call it directly.After this all of these collapse to
3010 minnehaha ave:3010 Minnehaha Ave,3010 Minnehaha Ave STE 420,3010 Minnehaha Ave Suite 420,3010 Minnehaha Ave Unit 420,3010 Minnehaha Ave #420,3010 minnehaha ave, ste 420.Part C — one-time migration command
wp data-machine-events check merge-duplicate-venues [--dry-run] [--apply] [--limit=N] [--format=table|csv|json]Defaults to dry-run.
--applycommits. Walks every venue term, groups by normalized name AND normalized address+city, and for each cluster picks the lowest term_id as winner. Hands each(winner, loser)pair to the newVenueMergeHelper::merge()which performs the merge in this strict order:wp_set_object_terms()sott_countstays in sync (no rawterm_relationshipsSQL).{prefix}datamachine_flowsthat point at the loser term_id — handles both flathandler_config.venueand nestedhandler_config.universal_web_scraper.venueshapes observed in production. Ticketmaster'svenue_idis intentionally left alone (external venue identifier, not a WP term).venue_mergeinfo entry via thedatamachine_logaction for the audit trail.Respects a
_venue_no_merge=1term-meta opt-out on either side of a cluster (operator protection for legit multi-room buildings or otherwise-similar venues that must stay separate).Output table columns:
cluster_key, winner_id, winner_name, loser_ids, loser_names, total_posts_reassigned, total_flows_reassigned, action_taken.Tests
tests/Unit/VenueNormalizationTest.phptest_normalize_venue_name_collapses_ampersand_and_andtest_normalize_venue_name_collapses_apostrophestest_normalize_venue_name_idempotenttest_normalize_address_strips_suite_suffixtest_normalize_address_idempotenttest_find_or_create_venue_returns_same_id_for_ampersand_variants(integration)test_find_or_create_venue_returns_same_id_for_suite_address_variants(integration)tests/Unit/VenueMergeHelperTest.phptest_merge_command_dry_run_lists_clusters_without_writestest_merge_command_apply_reassigns_posts_and_deletes_losertest_merge_command_apply_reassigns_flow_handler_configstest_merge_command_respects_no_merge_opt_outtest_merge_command_smart_merge_does_not_overwrite_winnerLive smoke-test of the normalizers via
php -ron the production WP load confirms every variant in the issue body collapses correctly.homeboy audit .reports zero new outliers attributable to the new files.Migration playbook
After merge + deploy, run on each subsite that owns venue terms (events.extrachill.com is the primary):
Review the cluster table, sanity-check the winners, then commit:
Use
--limit=Nto throttle batches if a single run would be too large.Constraints honored
fix(venue-dedup):andfeat(check):).CHANGELOG.mdedits. No release. No deploy./var/lib/datamachine/workspace/data-machine-events@fix-venue-dedup-normalization).main. Not merged.