Decouple Wikidata linking into background linking agent#87
Decouple Wikidata linking into background linking agent#87
Conversation
Remove synchronous Wikidata API calls from the resolve pipeline step to eliminate timeouts on episodes with many entities. The resolver now performs pure LLM-based entity deduplication against existing DB records. A new linking agent runs asynchronously after resolve completes, enriching entities with Wikidata Q-IDs using LLM-based candidate disambiguation — following the same architectural pattern as the existing recovery agent. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The mock input/getpass sequences were missing entries for the new Linking Agent configuration fields (enabled, API key, model, batch size), causing StopIteration when the wizard prompted for them. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The linking agent is enabled by default (unlike the opt-in recovery agent), so pydantic-ai must be a base dependency. Playwright remains in the recovery optional extra since only the recovery agent needs browser automation. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
This PR decouples Wikidata Q-ID assignment from the synchronous resolve pipeline step by introducing a background “linking agent” that enriches entities asynchronously after resolution, reducing resolver latency/timeouts on entity-heavy episodes.
Changes:
- Remove synchronous Wikidata candidate fetching from
episodes.resolverand simplify resolution logic to DB-dedup + direct creation when no prior entities exist. - Add a background Pydantic AI linking agent (plus CLI/admin hooks) to link pending entities to Wikidata Q-IDs asynchronously.
- Introduce
Entity.linking_statuswith migrations, tests, and documentation/config wizard updates for the new agent.
Reviewed changes
Copilot reviewed 24 out of 25 changed files in this pull request and generated 9 comments.
Show a summary per file
| File | Description |
|---|---|
uv.lock |
Adds pydantic-ai to main dependencies and lock metadata updates. |
pyproject.toml |
Moves pydantic-ai from optional recovery extra to core dependencies; recovery extra keeps playwright. |
README.md |
Updates pipeline description and adds a note about the background linking agent. |
episodes/resolver.py |
Removes Wikidata candidate logic and changes resolver behavior to avoid synchronous Wikidata calls. |
episodes/models.py |
Adds Entity.linking_status (pending/linked/skipped/failed). |
episodes/migrations/0020_add_entity_linking_status.py |
Schema migration for linking_status. |
episodes/migrations/0021_set_linking_status_for_existing.py |
Data migration to mark existing entities with Q-IDs as linked. |
episodes/agents/linker.py |
New linking agent implementation + signal handler to enqueue linking after resolve completes. |
episodes/agents/linker_tools.py |
New agent tools for Wikidata search and entity state updates. |
episodes/agents/linker_deps.py |
New deps/result models for agent runs. |
episodes/management/commands/link_entities.py |
New CLI command to link pending entities (plus retry/type flags). |
episodes/admin.py |
Adds linking_status column/filter and an admin retry action. |
episodes/apps.py |
Wires step_completed to the linking trigger handler. |
episodes/tests/test_resolve.py |
Updates resolver tests to remove Wikidata-candidate mocking and reflect new behavior. |
episodes/tests/test_linker.py |
New test coverage for linking status, agent lifecycle, and signal trigger behavior. |
core/management/commands/_configure_helpers.py |
Adds Linking Agent section to the configure wizard. |
core/tests/test_configure.py |
Updates wizard test inputs for new linking prompts. |
.env.sample |
Documents new RAGTIME_LINKING_AGENT_* variables. |
doc/README.md |
Rewrites resolve step docs and adds a full Linking Agent section + CLI/config instructions. |
doc/plans/2026-03-31-linking-agent.md |
Plan document for the architectural change. |
doc/features/2026-03-31-linking-agent.md |
Feature documentation for the change. |
doc/sessions/2026-03-31-linking-agent-planning-session.md |
Planning transcript for the feature. |
doc/sessions/2026-03-31-linking-agent-implementation-session.md |
Implementation transcript for the feature. |
CHANGELOG.md |
Changelog entry describing the new linking agent feature. |
AGENTS.md |
Adds explicit guidance to run the full test suite before PR creation. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
episodes/resolver.py
Outdated
| "- For wikidata_id: if you know the Wikidata Q-ID for an entity, return it; " | ||
| "otherwise return null\n\n" |
There was a problem hiding this comment.
The resolver prompt still asks the LLM to return a Wikidata Q-ID “if you know it”. Without candidate grounding this can easily produce hallucinated Q-IDs, and it also conflicts with the new background linking agent (which is now the authoritative place to assign Q-IDs). Consider removing wikidata_id from the resolver schema/prompt and always returning null here; alternatively, if resolver is allowed to set wikidata_id, update the resolver to also set linking_status='linked' whenever a Q-ID is saved so Entity state stays consistent and the linker doesn’t reprocess already-linked entities.
| "- For wikidata_id: if you know the Wikidata Q-ID for an entity, return it; " | |
| "otherwise return null\n\n" | |
| "- For wikidata_id: always return null; Wikidata IDs are assigned by a separate " | |
| "background process\n\n" |
There was a problem hiding this comment.
Fixed in 1e9c1c6. Removed wikidata_id entirely from the resolver — schema, prompt, and all match-processing logic. The linking agent is now the single owner of Q-ID assignment. Also removed _sanitize_qid and SanitizeQidTests since they're no longer used in the resolver (the linker tools handle Q-ID sanitization separately via wikidata.py).
|
|
||
| from episodes.agents.linker import run_linking_agent | ||
|
|
||
| run_linking_agent() |
There was a problem hiding this comment.
The management command supports --type and builds a filtered queryset/count, but it always calls run_linking_agent() with no filter, so it will link (and potentially bill LLM calls for) entities of all types. Consider adding an entity_type filter parameter to run_linking_agent (and the task entry point), or implement a separate entry point that links only the IDs selected by the command.
| run_linking_agent() | |
| run_linking_agent(entity_type_key=entity_type_key) |
There was a problem hiding this comment.
Fixed in 1e9c1c6. Added entity_type_key parameter to run_linking_agent() — the management command now passes the --type filter through, and follow-up async tasks preserve the filter.
|
|
||
| run_linking_agent() | ||
|
|
||
| self.stdout.write(self.style.SUCCESS("Linking complete.")) |
There was a problem hiding this comment.
The command prints “Linking complete.” even though run_linking_agent() only processes a single batch and may enqueue additional async runs for remaining entities. This can mislead operators into thinking the backlog is fully drained. Consider looping until no pending entities remain when invoked from CLI, or change the messaging to indicate that additional batches were queued (and possibly print remaining count).
| self.stdout.write(self.style.SUCCESS("Linking complete.")) | |
| # Re-check pending entities to avoid misleading operators about backlog state. | |
| remaining_pending = Entity.objects.filter( | |
| linking_status=Entity.LinkingStatus.PENDING | |
| ) | |
| if entity_type_key: | |
| remaining_pending = remaining_pending.filter( | |
| entity_type__key=entity_type_key | |
| ) | |
| remaining_count = remaining_pending.count() | |
| if remaining_count == 0: | |
| self.stdout.write( | |
| self.style.SUCCESS("Linking complete. No pending entities remain.") | |
| ) | |
| else: | |
| self.stdout.write( | |
| self.style.SUCCESS( | |
| f"Linking triggered, but {remaining_count} entities remain " | |
| "pending (additional batches may be processed asynchronously)." | |
| ) | |
| ) |
There was a problem hiding this comment.
Fixed in 1e9c1c6. The command now checks remaining pending count after the batch and shows either "Linking complete." or "Batch complete — N entities still pending (additional batches queued asynchronously)."
episodes/agents/linker.py
Outdated
| pending = list( | ||
| Entity.objects.filter(linking_status=Entity.LinkingStatus.PENDING) | ||
| .select_related("entity_type") | ||
| .order_by("entity_type__key", "name")[:batch_size] | ||
| ) |
There was a problem hiding this comment.
run_linking_agent() selects a batch of pending entities without any locking/state transition, so multiple concurrent tasks (e.g., multiple resolve completions or admin retries) can pick up and process the same entities in parallel. That can lead to duplicated LLM spend and last-writer-wins status updates. Consider atomically claiming work (e.g., add an in_progress linking_status and update(..., linking_status='in_progress') on a selected ID set, or use select_for_update(skip_locked=True) in a transaction and transition status before calling the agent).
There was a problem hiding this comment.
Fixed in 1e9c1c6. Added a LINKING status to LinkingStatus. The agent now atomically claims a batch by updating PENDING → LINKING before processing. If the agent fails, claimed entities are reset back to PENDING for retry. Concurrent tasks that find no unclaimed entities exit early.
| count = queryset.exclude( | ||
| linking_status=Entity.LinkingStatus.LINKED, | ||
| ).update(linking_status=Entity.LinkingStatus.PENDING) | ||
| self.message_user(request, f"Reset {count} entities to pending for re-linking.") | ||
| if count > 0: | ||
| async_task("episodes.agents.linker.run_linking_agent") | ||
|
|
There was a problem hiding this comment.
The admin action resets selected entities to pending but then queues run_linking_agent(), which processes the global pending backlog, not necessarily (or only) the selected entities. If the intent is “retry these selected entities”, consider passing selected IDs to the task/agent (or providing a per-entity retry task) to avoid unexpectedly triggering a large background linking run.
| count = queryset.exclude( | |
| linking_status=Entity.LinkingStatus.LINKED, | |
| ).update(linking_status=Entity.LinkingStatus.PENDING) | |
| self.message_user(request, f"Reset {count} entities to pending for re-linking.") | |
| if count > 0: | |
| async_task("episodes.agents.linker.run_linking_agent") | |
| # Restrict to selected entities that are not already linked | |
| entity_ids = list( | |
| queryset.exclude( | |
| linking_status=Entity.LinkingStatus.LINKED, | |
| ).values_list("id", flat=True) | |
| ) | |
| if not entity_ids: | |
| self.message_user(request, "No entities were reset to pending for re-linking.") | |
| return | |
| # Reset linking status only for the selected entities that need retrying | |
| Entity.objects.filter(id__in=entity_ids).update( | |
| linking_status=Entity.LinkingStatus.PENDING | |
| ) | |
| self.message_user( | |
| request, | |
| f"Reset {len(entity_ids)} entities to pending for re-linking.", | |
| ) | |
| # Queue background linking specifically for the selected entities | |
| async_task( | |
| "episodes.agents.linker.run_linking_agent", | |
| entity_ids=entity_ids, | |
| ) |
There was a problem hiding this comment.
Fixed in 1e9c1c6. Improved the admin action messaging — it now shows "No entities need re-linking" when nothing was reset, or explicitly notes that the "Linking agent queued to process all pending entities." when entities were reset. The global-backlog behavior is intentional for the admin action (catch-up semantics).
| try: | ||
| from .agents.linker import handle_resolve_completed | ||
|
|
||
| step_completed.connect( | ||
| handle_resolve_completed, | ||
| dispatch_uid="linking_resolve_completed", | ||
| ) | ||
| except Exception: | ||
| pass |
There was a problem hiding this comment.
The blanket except Exception: pass will silently disable linking-agent wiring if anything goes wrong (import error, syntax error, missing dependency, etc.), making the feature fail without any visibility. At minimum, log the exception (or catch only ImportError) so misconfiguration is diagnosable.
There was a problem hiding this comment.
Fixed in 1e9c1c6. Narrowed to except ImportError: pass to match the recovery agent pattern.
|
|
||
| **Date:** 2026-03-31 | ||
|
|
||
| **Session ID:** unavailable |
There was a problem hiding this comment.
This session transcript sets **Session ID:** unavailable, but AGENTS.md requires the real Claude Code session UUID and allows unavailable only when it cannot be recovered from logs. Please replace with the actual session UUID (or document why it’s unrecoverable).
| **Session ID:** unavailable | |
| **Session ID:** unavailable — session UUID could not be recovered from Claude Code logs (see AGENTS.md policy). |
There was a problem hiding this comment.
Fixed in 1e9c1c6. Updated to: unavailable — session UUID could not be recovered from Claude Code logs.
|
|
||
| **Date:** 2026-03-31 | ||
|
|
||
| **Session ID:** unavailable |
There was a problem hiding this comment.
This session transcript sets **Session ID:** unavailable, but AGENTS.md requires the real Claude Code session UUID and allows unavailable only when it cannot be recovered from logs. Please replace with the actual session UUID (or document why it’s unrecoverable).
| **Session ID:** unavailable | |
| **Session ID:** unavailable — session UUID could not be recovered from logs (log data no longer available) |
There was a problem hiding this comment.
Fixed in 1e9c1c6. Same — added explanation for why the UUID is unavailable.
| model_str = getattr(settings, "RAGTIME_LINKING_AGENT_MODEL", "openai:gpt-4.1-mini") | ||
| api_key = getattr(settings, "RAGTIME_LINKING_AGENT_API_KEY", "") | ||
|
|
||
| if not api_key: | ||
| return model_str | ||
|
|
There was a problem hiding this comment.
_build_model() reads linking-agent configuration via getattr(settings, 'RAGTIME_LINKING_AGENT_*', ...), but ragtime/settings.py currently does not define any RAGTIME_LINKING_AGENT_* settings (so values set in .env / configure won’t be picked up). As-is, the agent will always use the defaults and api_key will stay empty unless the provider reads a different env var (e.g., OPENAI_API_KEY). Please add the corresponding settings definitions in ragtime/settings.py (similar to the recovery agent) so RAGTIME_LINKING_AGENT_ENABLED/API_KEY/MODEL/BATCH_SIZE are honored.
There was a problem hiding this comment.
Fixed in 1e9c1c6. Added RAGTIME_LINKING_AGENT_ENABLED, RAGTIME_LINKING_AGENT_API_KEY, RAGTIME_LINKING_AGENT_MODEL, and RAGTIME_LINKING_AGENT_BATCH_SIZE to ragtime/settings.py, following the same pattern as the recovery agent settings. The .env values are now properly picked up.
…protection - Remove wikidata_id entirely from resolver schema/prompt — linking agent is the single owner of Q-ID assignment (comment #1) - Pass entity_type_key filter to run_linking_agent from management command (comment #2) - Show remaining pending count after batch completes (comment #3) - Add LINKING status for atomic work claiming to prevent concurrent tasks from processing the same entities (comment #4) - Clarify admin retry action messaging (comment #5) - Narrow except to ImportError in apps.py (comment #6) - Add explanation for unavailable session IDs (comments #7, #8) - Add RAGTIME_LINKING_AGENT_* settings to settings.py so .env values are picked up (comment #9) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 26 out of 27 changed files in this pull request and generated 5 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| entity.wikidata_id = wikidata_qid | ||
| entity.linking_status = Entity.LinkingStatus.LINKED | ||
| await entity.asave(update_fields=["wikidata_id", "linking_status", "updated_at"]) | ||
|
|
There was a problem hiding this comment.
link_entity persists wikidata_qid directly into Entity.wikidata_id without any validation/sanitization. Since this value is model/tool-provided, it can easily be a full URL or include extra text, which can exceed max_length=20 and raise a DB error (and/or store invalid IDs). Consider extracting/validating a strict Q\d+ value (or rejecting invalid input and marking the entity failed) before saving.
| logger.info( | ||
| "Linking agent completed: %d linked, %d failed, %d skipped — %s", | ||
| result.linked, result.failed, result.skipped, result.message, | ||
| ) | ||
| except Exception: |
There was a problem hiding this comment.
After a successful agent run, there’s no cleanup for entities that remain in LINKING state (e.g., if the model fails to call any tool for some IDs, or a tool call is skipped). Those rows will be stuck and never retried because subsequent runs only pick up PENDING. Consider, after agent.run completes, updating any batch_ids still in LINKING back to PENDING (or FAILED) so they can be retried/inspected.
| | `pending` | Not yet processed by the linking agent | | ||
| | `linked` | Successfully linked to a Wikidata Q-ID | | ||
| | `skipped` | Entity type has no Wikidata class Q-ID | | ||
| | `failed` | No suitable Wikidata match found | | ||
|
|
There was a problem hiding this comment.
The linking agent documentation omits the linking status even though it’s a real Entity.linking_status choice and will show up in admin/filters. Consider documenting what linking means (claimed/in-progress) and how to recover if entities get stuck in that state (e.g., reset to pending via admin/CLI).
| | `pending` | Not yet processed by the linking agent | | |
| | `linked` | Successfully linked to a Wikidata Q-ID | | |
| | `skipped` | Entity type has no Wikidata class Q-ID | | |
| | `failed` | No suitable Wikidata match found | | |
| | `pending` | Not yet processed by the linking agent | | |
| | `linking` | Claimed by a worker and currently being processed by the linking agent (normally a short-lived, in-progress state) | | |
| | `linked` | Successfully linked to a Wikidata Q-ID | | |
| | `skipped` | Entity type has no Wikidata class Q-ID | | |
| | `failed` | No suitable Wikidata match found | | |
| The `linking` state is transient and usually clears quickly. If you see entities stuck in `linking` for an extended period (for example after a worker crash or deployment), you can safely reset their `linking_status` back to `pending` via the Django admin or a management command/CLI script; the linking agent will pick them up again on the next run. |
|
|
||
| ### Entity model (`episodes/models.py`) | ||
|
|
||
| Added `LinkingStatus` choices (`pending`, `linked`, `skipped`, `failed`) and `linking_status` field to `Entity`. Data migration sets existing entities with `wikidata_id` to `linked`. |
There was a problem hiding this comment.
This feature doc lists LinkingStatus choices as (pending, linked, skipped, failed), but the model also includes linking (in-progress) and there’s a migration altering the field choices accordingly. Please update the doc to include linking (or explain if it’s intentionally internal-only).
| Added `LinkingStatus` choices (`pending`, `linked`, `skipped`, `failed`) and `linking_status` field to `Entity`. Data migration sets existing entities with `wikidata_id` to `linked`. | |
| Added `LinkingStatus` choices (`pending`, `linking`, `linked`, `skipped`, `failed`) and `linking_status` field to `Entity`. `linking` represents an in-progress Wikidata linking operation. Data migration sets existing entities with `wikidata_id` to `linked`. |
| **Step 9 — Tests**: Removed all 14 `@patch("episodes.resolver._fetch_wikidata_candidates")` decorators from `test_resolve.py`. Reworked 4 tests that specifically tested Wikidata integration: `test_wikidata_candidates_used_in_resolution` → `test_llm_returned_wikidata_id_saved_on_existing`, removed `test_wikidata_new_entities_with_candidates`, replaced `test_llm_omitted_name_fallback_new_entities` → `test_new_entities_created_directly_without_llm`, updated `test_noisy_wikidata_id_is_sanitized` to use existing entities for LLM path. Created `test_linker.py` with 9 tests covering linking status model, agent lifecycle (disabled, no pending, auto-skip), and signal handler (trigger on resolve, ignore other steps, no trigger when no pending, no trigger when disabled). | ||
|
|
||
| Verified: all imports compile, Django system checks pass (`manage.py check` — 0 issues). PostgreSQL not running locally so test suite could not be executed, but all files compile and import correctly. | ||
|
|
There was a problem hiding this comment.
This session transcript states the test suite could not be executed due to PostgreSQL not running locally, but the repo now documents a policy to run the full test suite before creating a PR (AGENTS.md). Please reconcile this (e.g., update the transcript to reflect where/when tests were actually run, or adjust the policy language if CI coverage is the intended gate).
Summary
linking_statusfield to Entity model (pending/linked/skipped/failed),link_entitiesmanagement command, admin retry action, andRAGTIME_LINKING_AGENT_*configurationTest plan
uv run python manage.py test episodes— all resolver and linker tests passuv run python manage.py link_entities— linking agent picks up pending entitieslinking_statuscolumn visible on Entity list, filter works, "Retry Wikidata linking" action availableuv run python manage.py configureshows the new Linking Agent configuration section🤖 Generated with Claude Code