Skip to content

Decouple Wikidata linking into background linking agent#87

Open
rafacm wants to merge 7 commits intomainfrom
feature/linking-agent
Open

Decouple Wikidata linking into background linking agent#87
rafacm wants to merge 7 commits intomainfrom
feature/linking-agent

Conversation

@rafacm
Copy link
Copy Markdown
Owner

@rafacm rafacm commented Mar 31, 2026

Summary

  • Remove synchronous Wikidata API calls from the resolve pipeline step to eliminate timeouts on episodes with many entities
  • Create a background linking agent (Pydantic AI) that asynchronously enriches entities with Wikidata Q-IDs after the pipeline completes — following the same architectural pattern as the existing recovery agent
  • Add linking_status field to Entity model (pending/linked/skipped/failed), link_entities management command, admin retry action, and RAGTIME_LINKING_AGENT_* configuration

Test plan

  • Start PostgreSQL and run uv run python manage.py test episodes — all resolver and linker tests pass
  • Process an episode end-to-end — resolve step completes fast without Wikidata API calls
  • Run uv run python manage.py link_entities — linking agent picks up pending entities
  • Check Django admin — linking_status column visible on Entity list, filter works, "Retry Wikidata linking" action available
  • Verify uv run python manage.py configure shows the new Linking Agent configuration section

🤖 Generated with Claude Code

rafacm and others added 6 commits March 31, 2026 09:08
Remove synchronous Wikidata API calls from the resolve pipeline step
to eliminate timeouts on episodes with many entities. The resolver now
performs pure LLM-based entity deduplication against existing DB
records. A new linking agent runs asynchronously after resolve
completes, enriching entities with Wikidata Q-IDs using LLM-based
candidate disambiguation — following the same architectural pattern
as the existing recovery agent.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The mock input/getpass sequences were missing entries for the new
Linking Agent configuration fields (enabled, API key, model, batch
size), causing StopIteration when the wizard prompted for them.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The linking agent is enabled by default (unlike the opt-in recovery
agent), so pydantic-ai must be a base dependency. Playwright remains
in the recovery optional extra since only the recovery agent needs
browser automation.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR decouples Wikidata Q-ID assignment from the synchronous resolve pipeline step by introducing a background “linking agent” that enriches entities asynchronously after resolution, reducing resolver latency/timeouts on entity-heavy episodes.

Changes:

  • Remove synchronous Wikidata candidate fetching from episodes.resolver and simplify resolution logic to DB-dedup + direct creation when no prior entities exist.
  • Add a background Pydantic AI linking agent (plus CLI/admin hooks) to link pending entities to Wikidata Q-IDs asynchronously.
  • Introduce Entity.linking_status with migrations, tests, and documentation/config wizard updates for the new agent.

Reviewed changes

Copilot reviewed 24 out of 25 changed files in this pull request and generated 9 comments.

Show a summary per file
File Description
uv.lock Adds pydantic-ai to main dependencies and lock metadata updates.
pyproject.toml Moves pydantic-ai from optional recovery extra to core dependencies; recovery extra keeps playwright.
README.md Updates pipeline description and adds a note about the background linking agent.
episodes/resolver.py Removes Wikidata candidate logic and changes resolver behavior to avoid synchronous Wikidata calls.
episodes/models.py Adds Entity.linking_status (pending/linked/skipped/failed).
episodes/migrations/0020_add_entity_linking_status.py Schema migration for linking_status.
episodes/migrations/0021_set_linking_status_for_existing.py Data migration to mark existing entities with Q-IDs as linked.
episodes/agents/linker.py New linking agent implementation + signal handler to enqueue linking after resolve completes.
episodes/agents/linker_tools.py New agent tools for Wikidata search and entity state updates.
episodes/agents/linker_deps.py New deps/result models for agent runs.
episodes/management/commands/link_entities.py New CLI command to link pending entities (plus retry/type flags).
episodes/admin.py Adds linking_status column/filter and an admin retry action.
episodes/apps.py Wires step_completed to the linking trigger handler.
episodes/tests/test_resolve.py Updates resolver tests to remove Wikidata-candidate mocking and reflect new behavior.
episodes/tests/test_linker.py New test coverage for linking status, agent lifecycle, and signal trigger behavior.
core/management/commands/_configure_helpers.py Adds Linking Agent section to the configure wizard.
core/tests/test_configure.py Updates wizard test inputs for new linking prompts.
.env.sample Documents new RAGTIME_LINKING_AGENT_* variables.
doc/README.md Rewrites resolve step docs and adds a full Linking Agent section + CLI/config instructions.
doc/plans/2026-03-31-linking-agent.md Plan document for the architectural change.
doc/features/2026-03-31-linking-agent.md Feature documentation for the change.
doc/sessions/2026-03-31-linking-agent-planning-session.md Planning transcript for the feature.
doc/sessions/2026-03-31-linking-agent-implementation-session.md Implementation transcript for the feature.
CHANGELOG.md Changelog entry describing the new linking agent feature.
AGENTS.md Adds explicit guidance to run the full test suite before PR creation.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +82 to +83
"- For wikidata_id: if you know the Wikidata Q-ID for an entity, return it; "
"otherwise return null\n\n"
Copy link

Copilot AI Mar 31, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The resolver prompt still asks the LLM to return a Wikidata Q-ID “if you know it”. Without candidate grounding this can easily produce hallucinated Q-IDs, and it also conflicts with the new background linking agent (which is now the authoritative place to assign Q-IDs). Consider removing wikidata_id from the resolver schema/prompt and always returning null here; alternatively, if resolver is allowed to set wikidata_id, update the resolver to also set linking_status='linked' whenever a Q-ID is saved so Entity state stays consistent and the linker doesn’t reprocess already-linked entities.

Suggested change
"- For wikidata_id: if you know the Wikidata Q-ID for an entity, return it; "
"otherwise return null\n\n"
"- For wikidata_id: always return null; Wikidata IDs are assigned by a separate "
"background process\n\n"

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 1e9c1c6. Removed wikidata_id entirely from the resolver — schema, prompt, and all match-processing logic. The linking agent is now the single owner of Q-ID assignment. Also removed _sanitize_qid and SanitizeQidTests since they're no longer used in the resolver (the linker tools handle Q-ID sanitization separately via wikidata.py).


from episodes.agents.linker import run_linking_agent

run_linking_agent()
Copy link

Copilot AI Mar 31, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The management command supports --type and builds a filtered queryset/count, but it always calls run_linking_agent() with no filter, so it will link (and potentially bill LLM calls for) entities of all types. Consider adding an entity_type filter parameter to run_linking_agent (and the task entry point), or implement a separate entry point that links only the IDs selected by the command.

Suggested change
run_linking_agent()
run_linking_agent(entity_type_key=entity_type_key)

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 1e9c1c6. Added entity_type_key parameter to run_linking_agent() — the management command now passes the --type filter through, and follow-up async tasks preserve the filter.


run_linking_agent()

self.stdout.write(self.style.SUCCESS("Linking complete."))
Copy link

Copilot AI Mar 31, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The command prints “Linking complete.” even though run_linking_agent() only processes a single batch and may enqueue additional async runs for remaining entities. This can mislead operators into thinking the backlog is fully drained. Consider looping until no pending entities remain when invoked from CLI, or change the messaging to indicate that additional batches were queued (and possibly print remaining count).

Suggested change
self.stdout.write(self.style.SUCCESS("Linking complete."))
# Re-check pending entities to avoid misleading operators about backlog state.
remaining_pending = Entity.objects.filter(
linking_status=Entity.LinkingStatus.PENDING
)
if entity_type_key:
remaining_pending = remaining_pending.filter(
entity_type__key=entity_type_key
)
remaining_count = remaining_pending.count()
if remaining_count == 0:
self.stdout.write(
self.style.SUCCESS("Linking complete. No pending entities remain.")
)
else:
self.stdout.write(
self.style.SUCCESS(
f"Linking triggered, but {remaining_count} entities remain "
"pending (additional batches may be processed asynchronously)."
)
)

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 1e9c1c6. The command now checks remaining pending count after the batch and shows either "Linking complete." or "Batch complete — N entities still pending (additional batches queued asynchronously)."

Comment on lines +154 to +158
pending = list(
Entity.objects.filter(linking_status=Entity.LinkingStatus.PENDING)
.select_related("entity_type")
.order_by("entity_type__key", "name")[:batch_size]
)
Copy link

Copilot AI Mar 31, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

run_linking_agent() selects a batch of pending entities without any locking/state transition, so multiple concurrent tasks (e.g., multiple resolve completions or admin retries) can pick up and process the same entities in parallel. That can lead to duplicated LLM spend and last-writer-wins status updates. Consider atomically claiming work (e.g., add an in_progress linking_status and update(..., linking_status='in_progress') on a selected ID set, or use select_for_update(skip_locked=True) in a transaction and transition status before calling the agent).

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 1e9c1c6. Added a LINKING status to LinkingStatus. The agent now atomically claims a batch by updating PENDING → LINKING before processing. If the agent fails, claimed entities are reset back to PENDING for retry. Concurrent tasks that find no unclaimed entities exit early.

Comment on lines +532 to +538
count = queryset.exclude(
linking_status=Entity.LinkingStatus.LINKED,
).update(linking_status=Entity.LinkingStatus.PENDING)
self.message_user(request, f"Reset {count} entities to pending for re-linking.")
if count > 0:
async_task("episodes.agents.linker.run_linking_agent")

Copy link

Copilot AI Mar 31, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The admin action resets selected entities to pending but then queues run_linking_agent(), which processes the global pending backlog, not necessarily (or only) the selected entities. If the intent is “retry these selected entities”, consider passing selected IDs to the task/agent (or providing a per-entity retry task) to avoid unexpectedly triggering a large background linking run.

Suggested change
count = queryset.exclude(
linking_status=Entity.LinkingStatus.LINKED,
).update(linking_status=Entity.LinkingStatus.PENDING)
self.message_user(request, f"Reset {count} entities to pending for re-linking.")
if count > 0:
async_task("episodes.agents.linker.run_linking_agent")
# Restrict to selected entities that are not already linked
entity_ids = list(
queryset.exclude(
linking_status=Entity.LinkingStatus.LINKED,
).values_list("id", flat=True)
)
if not entity_ids:
self.message_user(request, "No entities were reset to pending for re-linking.")
return
# Reset linking status only for the selected entities that need retrying
Entity.objects.filter(id__in=entity_ids).update(
linking_status=Entity.LinkingStatus.PENDING
)
self.message_user(
request,
f"Reset {len(entity_ids)} entities to pending for re-linking.",
)
# Queue background linking specifically for the selected entities
async_task(
"episodes.agents.linker.run_linking_agent",
entity_ids=entity_ids,
)

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 1e9c1c6. Improved the admin action messaging — it now shows "No entities need re-linking" when nothing was reset, or explicitly notes that the "Linking agent queued to process all pending entities." when entities were reset. The global-backlog behavior is intentional for the admin action (catch-up semantics).

Comment on lines +17 to +25
try:
from .agents.linker import handle_resolve_completed

step_completed.connect(
handle_resolve_completed,
dispatch_uid="linking_resolve_completed",
)
except Exception:
pass
Copy link

Copilot AI Mar 31, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The blanket except Exception: pass will silently disable linking-agent wiring if anything goes wrong (import error, syntax error, missing dependency, etc.), making the feature fail without any visibility. At minimum, log the exception (or catch only ImportError) so misconfiguration is diagnosable.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 1e9c1c6. Narrowed to except ImportError: pass to match the recovery agent pattern.


**Date:** 2026-03-31

**Session ID:** unavailable
Copy link

Copilot AI Mar 31, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This session transcript sets **Session ID:** unavailable, but AGENTS.md requires the real Claude Code session UUID and allows unavailable only when it cannot be recovered from logs. Please replace with the actual session UUID (or document why it’s unrecoverable).

Suggested change
**Session ID:** unavailable
**Session ID:** unavailable — session UUID could not be recovered from Claude Code logs (see AGENTS.md policy).

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 1e9c1c6. Updated to: unavailable — session UUID could not be recovered from Claude Code logs.


**Date:** 2026-03-31

**Session ID:** unavailable
Copy link

Copilot AI Mar 31, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This session transcript sets **Session ID:** unavailable, but AGENTS.md requires the real Claude Code session UUID and allows unavailable only when it cannot be recovered from logs. Please replace with the actual session UUID (or document why it’s unrecoverable).

Suggested change
**Session ID:** unavailable
**Session ID:** unavailable — session UUID could not be recovered from logs (log data no longer available)

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 1e9c1c6. Same — added explanation for why the UUID is unavailable.

Comment on lines +51 to +56
model_str = getattr(settings, "RAGTIME_LINKING_AGENT_MODEL", "openai:gpt-4.1-mini")
api_key = getattr(settings, "RAGTIME_LINKING_AGENT_API_KEY", "")

if not api_key:
return model_str

Copy link

Copilot AI Mar 31, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_build_model() reads linking-agent configuration via getattr(settings, 'RAGTIME_LINKING_AGENT_*', ...), but ragtime/settings.py currently does not define any RAGTIME_LINKING_AGENT_* settings (so values set in .env / configure won’t be picked up). As-is, the agent will always use the defaults and api_key will stay empty unless the provider reads a different env var (e.g., OPENAI_API_KEY). Please add the corresponding settings definitions in ragtime/settings.py (similar to the recovery agent) so RAGTIME_LINKING_AGENT_ENABLED/API_KEY/MODEL/BATCH_SIZE are honored.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 1e9c1c6. Added RAGTIME_LINKING_AGENT_ENABLED, RAGTIME_LINKING_AGENT_API_KEY, RAGTIME_LINKING_AGENT_MODEL, and RAGTIME_LINKING_AGENT_BATCH_SIZE to ragtime/settings.py, following the same pattern as the recovery agent settings. The .env values are now properly picked up.

…protection

- Remove wikidata_id entirely from resolver schema/prompt — linking
  agent is the single owner of Q-ID assignment (comment #1)
- Pass entity_type_key filter to run_linking_agent from management
  command (comment #2)
- Show remaining pending count after batch completes (comment #3)
- Add LINKING status for atomic work claiming to prevent concurrent
  tasks from processing the same entities (comment #4)
- Clarify admin retry action messaging (comment #5)
- Narrow except to ImportError in apps.py (comment #6)
- Add explanation for unavailable session IDs (comments #7, #8)
- Add RAGTIME_LINKING_AGENT_* settings to settings.py so .env values
  are picked up (comment #9)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 26 out of 27 changed files in this pull request and generated 5 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +57 to +60
entity.wikidata_id = wikidata_qid
entity.linking_status = Entity.LinkingStatus.LINKED
await entity.asave(update_fields=["wikidata_id", "linking_status", "updated_at"])

Copy link

Copilot AI Mar 31, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

link_entity persists wikidata_qid directly into Entity.wikidata_id without any validation/sanitization. Since this value is model/tool-provided, it can easily be a full URL or include extra text, which can exceed max_length=20 and raise a DB error (and/or store invalid IDs). Consider extracting/validating a strict Q\d+ value (or rejecting invalid input and marking the entity failed) before saving.

Copilot uses AI. Check for mistakes.
Comment on lines +212 to +216
logger.info(
"Linking agent completed: %d linked, %d failed, %d skipped — %s",
result.linked, result.failed, result.skipped, result.message,
)
except Exception:
Copy link

Copilot AI Mar 31, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After a successful agent run, there’s no cleanup for entities that remain in LINKING state (e.g., if the model fails to call any tool for some IDs, or a tool call is skipped). Those rows will be stuck and never retried because subsequent runs only pick up PENDING. Consider, after agent.run completes, updating any batch_ids still in LINKING back to PENDING (or FAILED) so they can be retried/inspected.

Copilot uses AI. Check for mistakes.
Comment on lines +155 to +159
| `pending` | Not yet processed by the linking agent |
| `linked` | Successfully linked to a Wikidata Q-ID |
| `skipped` | Entity type has no Wikidata class Q-ID |
| `failed` | No suitable Wikidata match found |

Copy link

Copilot AI Mar 31, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The linking agent documentation omits the linking status even though it’s a real Entity.linking_status choice and will show up in admin/filters. Consider documenting what linking means (claimed/in-progress) and how to recover if entities get stuck in that state (e.g., reset to pending via admin/CLI).

Suggested change
| `pending` | Not yet processed by the linking agent |
| `linked` | Successfully linked to a Wikidata Q-ID |
| `skipped` | Entity type has no Wikidata class Q-ID |
| `failed` | No suitable Wikidata match found |
| `pending` | Not yet processed by the linking agent |
| `linking` | Claimed by a worker and currently being processed by the linking agent (normally a short-lived, in-progress state) |
| `linked` | Successfully linked to a Wikidata Q-ID |
| `skipped` | Entity type has no Wikidata class Q-ID |
| `failed` | No suitable Wikidata match found |
The `linking` state is transient and usually clears quickly. If you see entities stuck in `linking` for an extended period (for example after a worker crash or deployment), you can safely reset their `linking_status` back to `pending` via the Django admin or a management command/CLI script; the linking agent will pick them up again on the next run.

Copilot uses AI. Check for mistakes.

### Entity model (`episodes/models.py`)

Added `LinkingStatus` choices (`pending`, `linked`, `skipped`, `failed`) and `linking_status` field to `Entity`. Data migration sets existing entities with `wikidata_id` to `linked`.
Copy link

Copilot AI Mar 31, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feature doc lists LinkingStatus choices as (pending, linked, skipped, failed), but the model also includes linking (in-progress) and there’s a migration altering the field choices accordingly. Please update the doc to include linking (or explain if it’s intentionally internal-only).

Suggested change
Added `LinkingStatus` choices (`pending`, `linked`, `skipped`, `failed`) and `linking_status` field to `Entity`. Data migration sets existing entities with `wikidata_id` to `linked`.
Added `LinkingStatus` choices (`pending`, `linking`, `linked`, `skipped`, `failed`) and `linking_status` field to `Entity`. `linking` represents an in-progress Wikidata linking operation. Data migration sets existing entities with `wikidata_id` to `linked`.

Copilot uses AI. Check for mistakes.
Comment on lines +38 to +41
**Step 9 — Tests**: Removed all 14 `@patch("episodes.resolver._fetch_wikidata_candidates")` decorators from `test_resolve.py`. Reworked 4 tests that specifically tested Wikidata integration: `test_wikidata_candidates_used_in_resolution` → `test_llm_returned_wikidata_id_saved_on_existing`, removed `test_wikidata_new_entities_with_candidates`, replaced `test_llm_omitted_name_fallback_new_entities` → `test_new_entities_created_directly_without_llm`, updated `test_noisy_wikidata_id_is_sanitized` to use existing entities for LLM path. Created `test_linker.py` with 9 tests covering linking status model, agent lifecycle (disabled, no pending, auto-skip), and signal handler (trigger on resolve, ignore other steps, no trigger when no pending, no trigger when disabled).

Verified: all imports compile, Django system checks pass (`manage.py check` — 0 issues). PostgreSQL not running locally so test suite could not be executed, but all files compile and import correctly.

Copy link

Copilot AI Mar 31, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This session transcript states the test suite could not be executed due to PostgreSQL not running locally, but the repo now documents a policy to run the full test suite before creating a PR (AGENTS.md). Please reconcile this (e.g., update the transcript to reflect where/when tests were actually run, or adjust the policy language if CI coverage is the intended gate).

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants