Skip to content

fix(venue-dedup): normalize ampersand/HTML-entity/apostrophe in names + strip suite suffixes from addresses + migration CLI#278

Merged
chubes4 merged 2 commits into
mainfrom
fix-venue-dedup-normalization
May 18, 2026
Merged

fix(venue-dedup): normalize ampersand/HTML-entity/apostrophe in names + strip suite suffixes from addresses + migration CLI#278
chubes4 merged 2 commits into
mainfrom
fix-venue-dedup-normalization

Conversation

@chubes4
Copy link
Copy Markdown
Member

@chubes4 chubes4 commented May 18, 2026

Summary

Fixes #276.

Today's discovery batches keep producing duplicate venue terms because the two collision normalizers introduced by PR #252 (address-aware venue resolution) miss two well-defined cases:

  1. Name normalization does not collapse &and, does not decode &, and never normalized apostrophes against terms that were created by pre-PR-252 codepaths.
  2. Address normalization does not strip STE, Suite, Unit, Apt, #NNN, etc. — so 3010 Minnehaha Ave STE 420 and 3010 Minnehaha Ave look like two different addresses.

A network audit today surfaced 6 production-confirmed collision clusters:

norm_name count terms
amos southend 2 17731 Amos Southend ‖ 50476 Amos' Southend
cliff bells 2 27397 Cliff Bells ‖ 50459 Cliff Bell's
guitars and cadillacs 2 32935 Guitars and Cadillacs ‖ 48049 Guitars & Cadillacs
harvard and stone 2 9523 Harvard & Stone ‖ 38934 Harvard and Stone
hook and ladder theater 2 27373 Hook and Ladder Theater ‖ 50460 Hook & Ladder Theater
proctors theatre 2 27296 Proctor's Theatre ‖ 27297 Proctors Theatre

Plus the suite-suffix variant for Hook & Ladder Theater:

  • 27373 — 3010 Minnehaha Ave STE 420 (Minneapolis)
  • 50460 — 3010 Minnehaha Ave (Minneapolis)

Each pair is the same physical venue with two term IDs, two stat lines, two location-archive entries, and split event counts. PR #252 will not fix this on its own — the existing duplicates need a one-time migration.

What changed

Part A — name normalization (Venue_Taxonomy::normalize_venue_name_for_matching)

  • html_entity_decode was already first; now also converts every & variant (spaced, tight, decoded from &) to literal and before the alphanumeric strip.
  • Apostrophes were already stripped via the non-alphanumeric pass; documented alongside the ampersand handling so the full collision surface is covered in one place.

After this:

  • Hook & Ladder Theaterhook and ladder theater
  • Hook and Ladder Theaterhook and ladder theater
  • Hook & Ladder Theaterhook and ladder theater
  • Amos Southend, Amos' Southendamos southend
  • Cliff Bells, Cliff Bell'scliff bells

Part B — address normalization (Venue_Taxonomy::normalize_address_for_matching)

  • Strips ste/suite/unit/apt/apartment/room/rm/#NNN suffixes before the existing street-suffix replacements. Two passes: word-boundary suffixes plus a dedicated #NNN pattern (since # is non-word and \b does not anchor against it).
  • Trailing comma/dash is trimmed so substring comparison stays clean.
  • Method is now public so the migration helper and unit tests can call it directly.

After this all of these collapse to 3010 minnehaha ave:
3010 Minnehaha Ave, 3010 Minnehaha Ave STE 420, 3010 Minnehaha Ave Suite 420, 3010 Minnehaha Ave Unit 420, 3010 Minnehaha Ave #420, 3010 minnehaha ave, ste 420.

Part C — one-time migration command

wp data-machine-events check merge-duplicate-venues [--dry-run] [--apply] [--limit=N] [--format=table|csv|json]

Defaults to dry-run. --apply commits. Walks every venue term, groups by normalized name AND normalized address+city, and for each cluster picks the lowest term_id as winner. Hands each (winner, loser) pair to the new VenueMergeHelper::merge() which performs the merge in this strict order:

  1. Smart-merge loser meta into winner (fill empties only — never overwrites winner data).
  2. Reassign posts via wp_set_object_terms() so tt_count stays in sync (no raw term_relationships SQL).
  3. Rewrite flow handler_config references in {prefix}datamachine_flows that point at the loser term_id — handles both flat handler_config.venue and nested handler_config.universal_web_scraper.venue shapes observed in production. Ticketmaster's venue_id is intentionally left alone (external venue identifier, not a WP term).
  4. Delete the loser term last (order matters — reassign first so no inbound reference orphans).
  5. Emits a structured venue_merge info entry via the datamachine_log action for the audit trail.

Respects a _venue_no_merge=1 term-meta opt-out on either side of a cluster (operator protection for legit multi-room buildings or otherwise-similar venues that must stay separate).

Output table columns: cluster_key, winner_id, winner_name, loser_ids, loser_names, total_posts_reassigned, total_flows_reassigned, action_taken.

Tests

tests/Unit/VenueNormalizationTest.php

  • test_normalize_venue_name_collapses_ampersand_and_and
  • test_normalize_venue_name_collapses_apostrophes
  • test_normalize_venue_name_idempotent
  • test_normalize_address_strips_suite_suffix
  • test_normalize_address_idempotent
  • test_find_or_create_venue_returns_same_id_for_ampersand_variants (integration)
  • test_find_or_create_venue_returns_same_id_for_suite_address_variants (integration)

tests/Unit/VenueMergeHelperTest.php

  • test_merge_command_dry_run_lists_clusters_without_writes
  • test_merge_command_apply_reassigns_posts_and_deletes_loser
  • test_merge_command_apply_reassigns_flow_handler_configs
  • test_merge_command_respects_no_merge_opt_out
  • test_merge_command_smart_merge_does_not_overwrite_winner

Live smoke-test of the normalizers via php -r on the production WP load confirms every variant in the issue body collapses correctly. homeboy audit . reports zero new outliers attributable to the new files.

Test runner caveat: this repo's phpunit.xml points at tests/bootstrap.php which is not checked in (no vendor/ directory either). PHPUnit tests do not execute locally in the worktree — they are designed to run in CI against a WP-PHPUnit harness. The test files follow the existing tests/Unit/*Test.php conventions and will be picked up by the same suite that exercises EventMergeHelperTest etc.

Migration playbook

After merge + deploy, run on each subsite that owns venue terms (events.extrachill.com is the primary):

wp --url=events.extrachill.com data-machine-events check merge-duplicate-venues --dry-run

Review the cluster table, sanity-check the winners, then commit:

wp --url=events.extrachill.com data-machine-events check merge-duplicate-venues --apply

Use --limit=N to throttle batches if a single run would be too large.

Constraints honored

  • Conventional commits (fix(venue-dedup): and feat(check):).
  • No version bump. No CHANGELOG.md edits. No release. No deploy.
  • Worktree-only changes (/var/lib/datamachine/workspace/data-machine-events@fix-venue-dedup-normalization).
  • PR against main. Not merged.

homeboy-ci Bot added 2 commits May 18, 2026 03:17
… + strip suite suffixes from addresses

Tightens the two venue-collision normalizers introduced by PR #252
(address-aware venue resolution) so today's 6 production-confirmed
duplicate venue-term clusters stop reproducing on every discovery batch.

Name normalization (Venue_Taxonomy::normalize_venue_name_for_matching):
  - html_entity_decode is already first; now also converts every '&'
    variant (spaced, tight, decoded from '&') to literal ' and '
    before the alphanumeric strip. 'Hook & Ladder Theater',
    'Hook and Ladder Theater', and 'Hook & Ladder Theater' now
    collapse to the same key.
  - Apostrophes were already stripped via the non-alphanumeric pass;
    documented alongside the ampersand handling so future readers see
    both collisions covered here.

Address normalization (Venue_Taxonomy::normalize_address_for_matching):
  - Strips suite/unit/apt/apartment/room/rm/#NNN suffixes before the
    existing street-suffix replacements so '3010 Minnehaha Ave STE 420'
    collapses to '3010 minnehaha ave'. Two passes: word-boundary
    suffixes plus a dedicated '#NNN' pattern (since '#' is non-word
    and word-boundary anchors do not catch it).
  - Trailing comma/dash from the strip is trimmed so substring
    comparison stays clean.
  - Method is now public so the upcoming migration helper and unit
    tests can call it directly.

Adds VenueMergeHelper (inc/Core/DuplicateDetection) — the term-level
analogue of EventMergeHelper. Single source of truth for 'given a
winner term and loser term, smart-merge meta + reassign posts +
rewrite flow handler_config refs + delete loser', in that exact order.
Reassignment is via wp_set_object_terms so tt_count cache stays
correct (no raw term_relationships SQL). Flow handler_config rewriting
covers both flat ('handler_config.venue') and nested
('handler_config.universal_web_scraper.venue') shapes observed in
production. Respects a _venue_no_merge=1 opt-out flag on either side.

Tests (VenueNormalizationTest): ampersand/HTML-entity collapse,
apostrophe collapse (Amos Southend, Cliff Bell's, Proctor's Theatre),
idempotency for both normalizers, suite/unit/apt/#NNN address collapse,
plus two integration tests that drive find_or_create_venue twice with
ampersand and suite-address variants and assert the same term_id
comes back.

Refs #276
…dation

Adds 'wp data-machine-events check merge-duplicate-venues' — a one-time
migration that consolidates duplicate venue terms produced before PR
#252 (address-aware venue resolution) and the issue #276 normalization
tightening shipped.

Behavior:
  1. Walks every venue term. Computes both the normalized name and
     the normalized address+city keys via the now-public Venue_Taxonomy
     helpers. Groups by either key; emits any group with >=2 terms.
     Each term is emitted in at most one cluster (name pass before
     address pass) to avoid double-processing.
  2. Per cluster, picks the LOWEST term_id as winner (oldest, most
     inbound link equity) and treats every other term as a loser.
  3. Hands each (winner, loser) pair to VenueMergeHelper::merge which:
     a. Smart-merges loser meta into winner (fill empties only — never
        overwrites winner data).
     b. Reassigns every post tagged with the loser to the winner via
        wp_set_object_terms() so tt_count stays correct.
     c. Rewrites flow handler_config references in
        {prefix}datamachine_flows that point at the loser term_id
        (both 'handler_config.venue' and
        'handler_config.universal_web_scraper.venue' shapes).
     d. Deletes the loser term last (order matters: reassign first,
        delete last, or inbound references orphan).
     e. Emits a structured 'venue_merge' info log via the
        datamachine_log action for the audit trail.

CLI surface:
  wp data-machine-events check merge-duplicate-venues
      [--dry-run] [--apply] [--limit=N] [--format=table|csv|json]

Defaults to dry-run (safer); --apply commits. --limit caps clusters
processed per run (default 50). Output table columns:
cluster_key, winner_id, winner_name, loser_ids, loser_names,
total_posts_reassigned, total_flows_reassigned, action_taken.

Respects a _venue_no_merge=1 term-meta opt-out on either side of a
cluster (operator-driven protection for legit multi-room buildings or
otherwise-similar venues that must stay separate).

Tests (VenueMergeHelperTest):
  - dry-run lists clusters without writing
  - apply reassigns posts and deletes loser; winner inherits empty
    meta fields from loser; pre-existing winner fields are preserved
  - apply rewrites flow handler_config refs in both flat and nested
    shapes
  - apply respects the _venue_no_merge opt-out (both terms intact)
  - smart-merge does not overwrite an existing winner address even
    when loser carries a suite-suffix variant

Refs #276
@homeboy-ci
Copy link
Copy Markdown
Contributor

homeboy-ci Bot commented May 18, 2026

Homeboy Results — data-machine-events

Audit

audit — passed

  • requested_detectors — 2 finding(s)
  • DuplicateDetection — 1 finding(s)
  • dead_code — 1 finding(s)
  • test_coverage — 1 finding(s)
  • Total: 5 finding(s)

Deep dive: homeboy audit data-machine-events --changed-since e0d4938

Tooling versions
  • Homeboy CLI: homeboy 0.182.0+1c74a36
  • Extension: wordpress from https://github.com/Extra-Chill/homeboy-extensions
  • Extension revision: dd47f26a
  • Action: Extra-Chill/homeboy-action@v2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

fix(venue-dedup): normalize ampersand/HTML-entity/apostrophe in venue names and strip suite suffixes from addresses

1 participant