Skip to content

feat(check): audit + repair venue terms with missing addresses or orphan status#283

Merged
chubes4 merged 3 commits into
mainfrom
feat-venue-audit-repair
May 19, 2026
Merged

feat(check): audit + repair venue terms with missing addresses or orphan status#283
chubes4 merged 3 commits into
mainfrom
feat-venue-audit-repair

Conversation

@chubes4
Copy link
Copy Markdown
Member

@chubes4 chubes4 commented May 18, 2026

Fixes #277.

Production evidence (events.extrachill.com, blog 7)

Today's network-wide venue audit found:

Metric Count % of 3,765
Venues with no _venue_address meta 262 7%
Orphan venues (wp_term_taxonomy.count = 0) 278 7%

Both classes are invisible to existing dedup / qualify paths:

  • No-address venues never collide via find_venue_by_address, so every batch discovery run creates a fresh duplicate term.
  • Orphans are noise in the verdict-log + qualify pipeline — qualify can resolve an incoming venue string to a term with zero event history, then attach a flow to it.

What this PR ships (3 commits)

Part A — wp data-machine-events check missing-venue-addresses

Commit `feat(check): add missing-venue-addresses audit + repair command`.

Walks every venue term whose `_venue_address` is empty/missing and tries, in order:

  1. Reverse geocode from `_venue_coordinates` via Nominatim (3,564 of 3,765 venues already have coords — 94.6%). Parses the response into `_venue_address`, `_venue_city`, `_venue_state`, `_venue_zip`, `_venue_country` and smart-merges into empty fields only.
  2. Places lookup — forward-search Nominatim for `{name} {city}` when coords are missing but the city is set. The top hit is gated by `VenueMergeHelper::names_are_similar()` (Jaccard ≥ 0.70 / substring / exact normalized) so we never silently rewrite "The Local Bar" → "Texas Music Theater."
  3. Residue — venues with neither coords nor city are surfaced as `action=no_repair_possible` for operator review; no writes.

Default `--dry-run`; `--apply` required to commit. `--limit` (default 50), `--format=table|csv|json`. The network paths are isolated in protected methods so unit tests stub them without hitting Nominatim.

Part B — `wp data-machine-events check orphan-venues`

Commit `feat(check): add orphan-venues audit + flag-not-delete command`.

For every venue term where `wp_term_taxonomy.count = 0`:

  1. Verify the count cache against the real `wp_term_relationships` join. Stale cache → `wp_update_term_count_now()` and skip (not actually orphaned).
  2. Protect terms referenced by an active flow — same LIKE patterns `VenueMergeHelper::reassign_flow_handler_configs()` uses. Stamp `_venue_orphan_protected_by_flow = <flow_id>` and skip deletion.
  3. Flag real orphans — default action is FLAG-NOT-DELETE. Stamps `_venue_orphan_flagged_at` with current Unix time; leaves the term in place. Operator decides whether to delete.
  4. Delete only with `--delete-orphans` opt-in — even then, terms with `_venue_no_merge` or a pre-existing `_venue_orphan_protected_by_flow` are kept.

Default `--dry-run`; `--apply` required; `--delete-orphans` further required for deletion. `--limit` (default 100), `--format=table|csv|json`.

Part C — `data-machine-events/venue-stats` ability

Commit `feat(digest): add data-machine-events/venue-stats ability`.

Read-only ability returning `{ no_address: int, orphans: int, total: int, queried_at: int }`. Two aggregate SQL queries + one LEFT JOIN so the weekly qualify-digest can call it across multiple sites cheaply.

Cross-plugin wiring is a follow-up. Per the issue test plan I considered shipping the digest-line additions in extrachill-events here too, but landing both in one PR violates the one-repo-per-PR rule and forces a cross-plugin review. Instead this PR exposes the ability and the digest wiring (reading the ability, formatting the two trend lines, persisting last-week's counts for the delta) lands in a follow-up issue against `extrachill-events`. Operators can run the CLI commands directly in the meantime.

Architectural notes / choices

  • Geocoding backend is Nominatim (OpenStreetMap), not Google. The issue mentions Google Places as the fallback strategy but this codebase has no Google Places / Geocoding client — only Nominatim via `Venue_Taxonomy::query_nominatim()` and `GeocodingAbilities`. Adding a Google client would introduce a paid API-key dependency the network does not provision. Reusing Nominatim's forward-search keeps the layer agnostic. Filed as a candidate follow-up: if Google coverage proves materially better than Nominatim on the residue set, add it as a second provider behind the same protected method.
  • Smart-merge is non-negotiable — mirrors `VenueMergeHelper::fill_empty_meta`'s fill-empties-only contract. A pre-existing `_venue_city` survives even if reverse-geocode returns a different city; operators may have curated something the automated path should never overwrite.
  • Rate-limiting — both Nominatim paths sleep `2s` between calls, matching `GeocodingAbilities::RATE_LIMIT_SECONDS` so a large `--apply` run doesn't get banned.
  • Conservative defaults — `--dry-run` is the default for both commands; `--delete-orphans` is opt-in even with `--apply` (the issue is explicit on this: "venues are user-visible content surface; deletion needs operator intent").

Tests

13 new tests across three files. All match the existing `VenueMergeHelperTest` pattern (`WP_UnitTestCase` + `wp_insert_term` against the real WP test DB).

`tests/Unit/CheckMissingVenueAddressesCommandTest.php`:

  • `test_dry_run_reports_count_without_writes`
  • `test_apply_fills_from_coordinates`
  • `test_apply_falls_back_to_places_search_when_no_coords`
  • `test_smart_merge_does_not_overwrite_existing_fields`
  • `test_residue_reports_no_repair_possible`
  • `test_places_lookup_rejects_low_name_similarity`

`tests/Unit/CheckOrphanVenuesCommandTest.php`:

  • `test_dry_run_lists_orphans`
  • `test_refreshes_stale_count_cache_before_deciding`
  • `test_protects_orphan_referenced_by_active_flow`
  • `test_flags_real_orphan_without_delete_orphans`
  • `test_deletes_real_orphan_with_delete_orphans`
  • `test_does_not_delete_protected_orphans_even_with_delete_orphans`

`tests/Unit/VenueStatsAbilitiesTest.php`:

  • `test_venue_stats_ability_returns_expected_shape`

The network surfaces (Nominatim reverse-geocode and search) are stubbed in tests via a small subclass that overrides the protected `reverse_geocode()` / `places_lookup()` methods — no network calls during test runs.

Local test runner caveat

I attempted to run the full suite locally via `homeboy test --extension wordpress` and hit a pre-existing bootstrap failure unrelated to this PR: `BOOTSTRAP FAILURE: load_deps:Error: Interface "WP_Agent_Token_Store" not found at /wordpress/wp-content/plugins/data-machine/inc/Core/Database/Agents/AgentTokens.php:24`. The same failure reproduces on an unmodified `main` checkout (`git stash` and re-run gives identical output). CI runs the suite with the MySQL service-container path in `.github/workflows/homeboy.yml`, which boots cleanly. `php -l` is clean on every changed file, and homeboy audit (informational outliers only — no new warning kinds my code introduced that aren't already present in `VenueMergeHelperTest` and `CheckMergeDuplicateVenuesCommand`).

Ops runbook (post-merge + deploy)

  1. `wp --url=events.extrachill.com data-machine-events check missing-venue-addresses --dry-run --limit=10` — sanity check on a small slice.
  2. `wp --url=events.extrachill.com data-machine-events check missing-venue-addresses --apply --limit=50` — first repair batch.
  3. `wp --url=events.extrachill.com data-machine-events check orphan-venues --dry-run` — preview orphan list.
  4. `wp --url=events.extrachill.com data-machine-events check orphan-venues --apply` — flag-only first pass.
  5. Review flagged terms via `_venue_orphan_flagged_at` meta. Decide which to delete.
  6. `wp --url=events.extrachill.com data-machine-events check orphan-venues --apply --delete-orphans` — delete confirmed orphans.

Not in this PR

  • Auto-deleting orphans by default (the issue is explicit on this).
  • Cross-pipeline address normalization for venues mapped to two cities (operator territory; file separately if it surfaces).
  • REST surface for these audits (CLI-only for v1).
  • The extrachill-events digest line additions (follow-up issue in that repo; this PR exposes the venue-stats ability the digest will consume).

cc <@532385681268408341>

Extra Chill Bot added 3 commits May 18, 2026 18:56
Part A of #277. Scans venue terms whose `_venue_address` meta is empty
or missing (262 of 3,765 on events.extrachill.com today, 7%) and
repairs them via two paths:

  1. Reverse-geocode from `_venue_coordinates` via the existing
     Nominatim surface. Smart-merges parsed components — address,
     city, state, zip, country — into empty meta fields only,
     mirroring VenueMergeHelper's fill-empties-only contract.

  2. Forward-search Nominatim with `{name} {city}` for venues that
     have a city but no coordinates. The top hit is gated by
     VenueMergeHelper::names_are_similar so we never silently rewrite
     "The Local Bar" → "Texas Music Theater" just because Nominatim
     returned a top match.

  3. Residue (no coords AND no city) is surfaced as
     `action=no_repair_possible` for operator review; no writes.

Default --dry-run; --apply required to commit. --limit defaults to 50
to keep single-run scope bounded for ops review. --format supports
table / csv / json. The two network paths are isolated in protected
methods so unit tests can stub them without hitting Nominatim.

Architectural note: this codebase uses Nominatim (OpenStreetMap) as
its only geocoding surface. The original issue mentioned Google
Places as the fallback strategy; that client does not exist here and
adding it would introduce a paid API-key dependency the network does
not provision. Reusing Nominatim's forward-search keeps the layer
agnostic and is filed in the PR body as a follow-up if Google
coverage proves materially better.

Tests cover dry-run/apply, reverse-geocode happy path, places-lookup
fallback, smart-merge non-overwrite, residue no-op, and the
name-similarity rejection.
Part B of #277. Scans venue terms whose `wp_term_taxonomy.count = 0`
(278 of 3,765 on events.extrachill.com today, 7%) and processes each
through a 4-step decision tree:

  1. VERIFY the count cache against the real `wp_term_relationships`
     join. If a real relationship exists, refresh the cache via
     `wp_update_term_count_now()` and skip the term — it is not
     actually orphaned, the cache just went stale.

  2. PROTECT terms referenced by an active flow. A `flow_config` JSON
     blob containing `"venue":"<term_id>"` (flat or nested
     handler_config shapes — same LIKE patterns
     VenueMergeHelper::reassign_flow_handler_configs uses) means the
     term is in active use even though it has zero events yet. We
     stamp `_venue_orphan_protected_by_flow = <flow_id>` and skip
     deletion.

  3. FLAG real orphans. The default action is FLAG-NOT-DELETE: stamp
     `_venue_orphan_flagged_at` with the current Unix timestamp and
     leave the term in place. Operator decides whether to delete
     later. Surface for visibility, not auto-destruction.

  4. DELETE only with --delete-orphans opt-in (even when --apply is
     set). Even with --delete-orphans, terms protected by
     `_venue_no_merge` (VenueMergeHelper::NO_MERGE_META_KEY) or a
     pre-existing `_venue_orphan_protected_by_flow` meta are kept.

Default --dry-run; --apply required to commit any writes;
--delete-orphans further required for deletion. --limit defaults to
100, --format supports table / csv / json.

Tests cover dry-run no-op, stale-cache refresh, flow protection,
default flag-only behavior, --delete-orphans deletion, and the
two-flavor protection-from-deletion guard.
Part C of #277. Exposes a small read-only ability that returns three
counts for the weekly qualify-digest's trend lines:

  - no_address: venue terms whose `_venue_address` meta is empty/missing
  - orphans:    venue terms whose `wp_term_taxonomy.count = 0`
  - total:      total venue terms
  - queried_at: Unix timestamp at which the snapshot was taken

Implementation uses two aggregate SQL queries plus one LEFT JOIN
against termmeta, so the digest can call this weekly across multiple
sites without pulling every term into PHP. Permission gated to
`manage_options` or WP-CLI context.

The intended consumer is the qualify-digest ability in
extrachill-events (#79). The cross-plugin wiring (reading this
ability from the digest, formatting the two trend lines, and
persisting last-week's counts for the delta) is a follow-up issue in
that repo — this PR stays scoped to data-machine-events and ships
only the ability the digest will consume. Operators can run the
CLI commands from #277 Parts A/B directly in the meantime.

Test covers the response shape against a seeded fixture of 5 venue
terms split across the three categories.
@homeboy-ci
Copy link
Copy Markdown
Contributor

homeboy-ci Bot commented May 18, 2026

Homeboy Results — data-machine-events

Audit

audit — passed

  • test_coverage — 3 finding(s)
  • intra-method-duplication — 2 finding(s)
  • requested_detectors — 2 finding(s)
  • dead_code — 1 finding(s)
  • duplication — 1 finding(s)
  • Total: 9 finding(s)

Deep dive: homeboy audit data-machine-events --changed-since abeb5de

Tooling versions
  • Homeboy CLI: homeboy 0.182.0+0e7aab9
  • Extension: wordpress from https://github.com/Extra-Chill/homeboy-extensions
  • Extension revision: 65942142
  • Action: Extra-Chill/homeboy-action@v2

@chubes4 chubes4 merged commit 1490548 into main May 19, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat(check): audit + repair venue terms with missing addresses or orphan status

1 participant