feat(check): audit + repair venue terms with missing addresses or orphan status#283
Merged
Conversation
added 3 commits
May 18, 2026 18:56
Part A of #277. Scans venue terms whose `_venue_address` meta is empty or missing (262 of 3,765 on events.extrachill.com today, 7%) and repairs them via two paths: 1. Reverse-geocode from `_venue_coordinates` via the existing Nominatim surface. Smart-merges parsed components — address, city, state, zip, country — into empty meta fields only, mirroring VenueMergeHelper's fill-empties-only contract. 2. Forward-search Nominatim with `{name} {city}` for venues that have a city but no coordinates. The top hit is gated by VenueMergeHelper::names_are_similar so we never silently rewrite "The Local Bar" → "Texas Music Theater" just because Nominatim returned a top match. 3. Residue (no coords AND no city) is surfaced as `action=no_repair_possible` for operator review; no writes. Default --dry-run; --apply required to commit. --limit defaults to 50 to keep single-run scope bounded for ops review. --format supports table / csv / json. The two network paths are isolated in protected methods so unit tests can stub them without hitting Nominatim. Architectural note: this codebase uses Nominatim (OpenStreetMap) as its only geocoding surface. The original issue mentioned Google Places as the fallback strategy; that client does not exist here and adding it would introduce a paid API-key dependency the network does not provision. Reusing Nominatim's forward-search keeps the layer agnostic and is filed in the PR body as a follow-up if Google coverage proves materially better. Tests cover dry-run/apply, reverse-geocode happy path, places-lookup fallback, smart-merge non-overwrite, residue no-op, and the name-similarity rejection.
Part B of #277. Scans venue terms whose `wp_term_taxonomy.count = 0` (278 of 3,765 on events.extrachill.com today, 7%) and processes each through a 4-step decision tree: 1. VERIFY the count cache against the real `wp_term_relationships` join. If a real relationship exists, refresh the cache via `wp_update_term_count_now()` and skip the term — it is not actually orphaned, the cache just went stale. 2. PROTECT terms referenced by an active flow. A `flow_config` JSON blob containing `"venue":"<term_id>"` (flat or nested handler_config shapes — same LIKE patterns VenueMergeHelper::reassign_flow_handler_configs uses) means the term is in active use even though it has zero events yet. We stamp `_venue_orphan_protected_by_flow = <flow_id>` and skip deletion. 3. FLAG real orphans. The default action is FLAG-NOT-DELETE: stamp `_venue_orphan_flagged_at` with the current Unix timestamp and leave the term in place. Operator decides whether to delete later. Surface for visibility, not auto-destruction. 4. DELETE only with --delete-orphans opt-in (even when --apply is set). Even with --delete-orphans, terms protected by `_venue_no_merge` (VenueMergeHelper::NO_MERGE_META_KEY) or a pre-existing `_venue_orphan_protected_by_flow` meta are kept. Default --dry-run; --apply required to commit any writes; --delete-orphans further required for deletion. --limit defaults to 100, --format supports table / csv / json. Tests cover dry-run no-op, stale-cache refresh, flow protection, default flag-only behavior, --delete-orphans deletion, and the two-flavor protection-from-deletion guard.
Part C of #277. Exposes a small read-only ability that returns three counts for the weekly qualify-digest's trend lines: - no_address: venue terms whose `_venue_address` meta is empty/missing - orphans: venue terms whose `wp_term_taxonomy.count = 0` - total: total venue terms - queried_at: Unix timestamp at which the snapshot was taken Implementation uses two aggregate SQL queries plus one LEFT JOIN against termmeta, so the digest can call this weekly across multiple sites without pulling every term into PHP. Permission gated to `manage_options` or WP-CLI context. The intended consumer is the qualify-digest ability in extrachill-events (#79). The cross-plugin wiring (reading this ability from the digest, formatting the two trend lines, and persisting last-week's counts for the delta) is a follow-up issue in that repo — this PR stays scoped to data-machine-events and ships only the ability the digest will consume. Operators can run the CLI commands from #277 Parts A/B directly in the meantime. Test covers the response shape against a seeded fixture of 5 venue terms split across the three categories.
Contributor
Homeboy Results —
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes #277.
Production evidence (events.extrachill.com, blog 7)
Today's network-wide venue audit found:
_venue_addressmetawp_term_taxonomy.count = 0)Both classes are invisible to existing dedup / qualify paths:
find_venue_by_address, so every batch discovery run creates a fresh duplicate term.What this PR ships (3 commits)
Part A —
wp data-machine-events check missing-venue-addressesCommit `feat(check): add missing-venue-addresses audit + repair command`.
Walks every venue term whose `_venue_address` is empty/missing and tries, in order:
Default `--dry-run`; `--apply` required to commit. `--limit` (default 50), `--format=table|csv|json`. The network paths are isolated in protected methods so unit tests stub them without hitting Nominatim.
Part B — `wp data-machine-events check orphan-venues`
Commit `feat(check): add orphan-venues audit + flag-not-delete command`.
For every venue term where `wp_term_taxonomy.count = 0`:
Default `--dry-run`; `--apply` required; `--delete-orphans` further required for deletion. `--limit` (default 100), `--format=table|csv|json`.
Part C — `data-machine-events/venue-stats` ability
Commit `feat(digest): add data-machine-events/venue-stats ability`.
Read-only ability returning `{ no_address: int, orphans: int, total: int, queried_at: int }`. Two aggregate SQL queries + one LEFT JOIN so the weekly qualify-digest can call it across multiple sites cheaply.
Cross-plugin wiring is a follow-up. Per the issue test plan I considered shipping the digest-line additions in extrachill-events here too, but landing both in one PR violates the one-repo-per-PR rule and forces a cross-plugin review. Instead this PR exposes the ability and the digest wiring (reading the ability, formatting the two trend lines, persisting last-week's counts for the delta) lands in a follow-up issue against `extrachill-events`. Operators can run the CLI commands directly in the meantime.
Architectural notes / choices
Tests
13 new tests across three files. All match the existing `VenueMergeHelperTest` pattern (`WP_UnitTestCase` + `wp_insert_term` against the real WP test DB).
`tests/Unit/CheckMissingVenueAddressesCommandTest.php`:
`tests/Unit/CheckOrphanVenuesCommandTest.php`:
`tests/Unit/VenueStatsAbilitiesTest.php`:
The network surfaces (Nominatim reverse-geocode and search) are stubbed in tests via a small subclass that overrides the protected `reverse_geocode()` / `places_lookup()` methods — no network calls during test runs.
Local test runner caveat
I attempted to run the full suite locally via `homeboy test --extension wordpress` and hit a pre-existing bootstrap failure unrelated to this PR: `BOOTSTRAP FAILURE: load_deps:Error: Interface "WP_Agent_Token_Store" not found at /wordpress/wp-content/plugins/data-machine/inc/Core/Database/Agents/AgentTokens.php:24`. The same failure reproduces on an unmodified `main` checkout (`git stash` and re-run gives identical output). CI runs the suite with the MySQL service-container path in `.github/workflows/homeboy.yml`, which boots cleanly. `php -l` is clean on every changed file, and homeboy audit (informational outliers only — no new warning kinds my code introduced that aren't already present in `VenueMergeHelperTest` and `CheckMergeDuplicateVenuesCommand`).
Ops runbook (post-merge + deploy)
Not in this PR
cc <@532385681268408341>