rafacm · rafacm · Mar 31, 2026 · Mar 31, 2026 · Mar 31, 2026 · Mar 31, 2026
diff --git a/.env.sample b/.env.sample
@@ -82,6 +82,16 @@ RAGTIME_RECOVERY_AGENT_MODEL=
 # Timeout in seconds for recovery agent attempts (default: 120)
 RAGTIME_RECOVERY_AGENT_TIMEOUT=
 
+# Linking agent — async Wikidata entity linking after pipeline resolve step
+# Enable the linking agent (true/false, default: true)
+RAGTIME_LINKING_AGENT_ENABLED=
+# API key for the linking agent LLM provider
+RAGTIME_LINKING_AGENT_API_KEY=
+# Pydantic AI model string (default: openai:gpt-4.1-mini)
+RAGTIME_LINKING_AGENT_MODEL=
+# Batch size for linking agent (default: 50)
+RAGTIME_LINKING_AGENT_BATCH_SIZE=
+
 # Vector store backend (chroma, etc.)
 RAGTIME_VECTOR_STORE=
 # ChromaDB server host (default: localhost, omit for embedded/local mode)

diff --git a/AGENTS.md b/AGENTS.md
@@ -98,6 +98,14 @@ The commit for a given feature MUST contain the plan, the feature documentation,
 
 ## PR Creation
 
+Before creating a PR, run the full test suite and verify it passes:
+
+```bash
+uv run python manage.py test --verbosity 2
+```
+
+Do not create the PR if tests are failing. Fix the failures first.
+
 When creating PRs, ensure the PR includes: plan document, feature doc, session transcripts (planning + implementation), and changelog entry. Review the Documentation section above for full requirements before creating the PR.
 
 ## GitHub API (`gh`)

diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -4,6 +4,12 @@ All notable changes to this project will be documented in this file.
 
 The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.1.0/).
 
+## 2026-03-31
+
+### Added
+
+- Background linking agent — decouple Wikidata entity linking from the pipeline resolve step into an asynchronous Pydantic AI agent. The resolve step now performs pure LLM-based entity deduplication without external API calls, eliminating Wikidata timeouts. A linking agent runs in the background after resolve completes, enriching entities with Wikidata Q-IDs using LLM-based candidate disambiguation. Adds `linking_status` field to Entity model, `link_entities` management command, admin retry action, and `RAGTIME_LINKING_AGENT_*` configuration — [plan](doc/plans/2026-03-31-linking-agent.md), [feature](doc/features/2026-03-31-linking-agent.md), [planning session](doc/sessions/2026-03-31-linking-agent-planning-session.md), [implementation session](doc/sessions/2026-03-31-linking-agent-implementation-session.md)
+
 ## 2026-03-23
 
 ### Fixed

diff --git a/README.md b/README.md
@@ -25,7 +25,7 @@ RAGtime is a Django application for ingesting jazz-related podcast episodes. It
 
 - 🎙️ **Episode Ingestion** — Add podcast episodes by URL. RAGtime scrapes metadata (title, description, date, image), downloads audio, and processes it through the pipeline.
 - 📝 **Multilingual Transcription** — Transcribes episodes using configurable backends (Whisper API by default) with segment and word-level timestamps. Supports multiple languages (English, Spanish, German, Swedish, etc.).
-- 🔍 **Entity Extraction** — Identifies jazz entities: musicians, musical groups, albums, music venues, recording sessions, record labels, years. Entities are resolved against existing records using LLM-based matching.
+- 🔍 **Entity Extraction** — Identifies jazz entities: musicians, musical groups, albums, music venues, recording sessions, record labels, years. Entities are resolved against existing records using LLM-based matching. A background linking agent asynchronously enriches entities with Wikidata Q-IDs without blocking the pipeline.
 - 📇 **Episode Indexing** — Splits transcripts into segments and generates multilingual embeddings stored in ChromaDB. Enables cross-language semantic search so Scott can retrieve relevant content regardless of the question's language.
 - 🎷 **Scott — Your Jazz AI** — A conversational agent that answers questions strictly from ingested episode content. Scott responds in the user's language and provides references to specific episodes and timestamps. Responses stream in real-time.
 - 📊 **AI Evaluation** — Measures pipeline and Scott quality using [RAGAS](https://docs.ragas.io/) (faithfulness, answer relevancy, context precision/recall) with scores tracked in [Langfuse](https://langfuse.com/docs/scores/model-based-evals/ragas).
@@ -65,12 +65,14 @@ Each step updates the episode's `status` field. A `post_save` signal dispatches
 | 5 | 📋 Summarize | `summarizing` | LLM-generated episode summary |
 | 6 | ✂️ Chunk | `chunking` | Split transcript into ~150-word chunks |
 | 7 | 🔍 Extract | `extracting` | Named entity recognition per chunk |
-| 8 | 🧩 Resolve | `resolving` | Entity linking and deduplication via Wikidata |
+| 8 | 🧩 Resolve | `resolving` | LLM-based entity deduplication against existing DB records |
 | 9 | 📐 Embed | `embedding` | Multilingual embeddings into ChromaDB |
 | 10 | ✅ Ready | `ready` | Episode available for Scott to query |
 
 _Steps 9–10 (Embed, Ready) are planned and not yet implemented._
 
+After the resolve step completes, a **linking agent** runs asynchronously to enrich entities with [Wikidata](https://www.wikidata.org/) Q-IDs. This is not a pipeline step — it never blocks episode processing. See the [linking agent documentation](doc/README.md#linking-agent) for details.
+
 See the [full pipeline documentation](doc/README.md) for per-step details, entity types, and the recovery layer.
 
 ## Documentation

diff --git a/core/management/commands/_configure_helpers.py b/core/management/commands/_configure_helpers.py
@@ -124,6 +124,23 @@
             },
         ],
     },
+    {
+        "name": "Linking Agent",
+        "description": "Async Wikidata entity linking after pipeline resolve step",
+        "shareable": False,
+        "subsystems": [
+            {
+                "prefix": "RAGTIME_LINKING",
+                "label": "Linking Agent",
+                "fields": [
+                    ("AGENT_ENABLED", "true", False),
+                    ("AGENT_API_KEY", "", True),
+                    ("AGENT_MODEL", "openai:gpt-4.1-mini", False),
+                    ("AGENT_BATCH_SIZE", "50", False),
+                ],
+            },
+        ],
+    },
     {
         "name": "LLM Observability",
         "description": "Langfuse tracing for LLM calls (optional)",

diff --git a/core/tests/test_configure.py b/core/tests/test_configure.py
@@ -230,6 +230,7 @@ def test_shared_mode_wizard(self, mock_input, mock_getpass):
             "sk-newkey123",   # Shared LLM API key
             "sk-newkey123",   # Transcription API key
             "",               # Recovery agent API key (keep default)
+            "",               # Linking agent API key (keep default)
             "",               # Langfuse secret key (keep default)
             "",               # Langfuse public key (keep default)
         ]
@@ -255,6 +256,9 @@ def test_shared_mode_wizard(self, mock_input, mock_getpass):
             "",               # Recovery agent enabled (keep default)
             "",               # Recovery agent model (keep default)
             "",               # Recovery agent timeout (keep default)
+            "",               # Linking agent enabled (keep default)
+            "",               # Linking agent model (keep default)
+            "",               # Linking agent batch size (keep default)
             "",               # Langfuse enabled (keep default)
             "",               # Langfuse host (keep default)
         ]
@@ -335,6 +339,7 @@ def test_rerun_preserves_non_ragtime_lines(self, mock_input, mock_getpass):
             "sk-newkey123",   # Shared LLM API key
             "sk-newkey123",   # Transcription API key
             "",               # Recovery agent API key (keep default)
+            "",               # Linking agent API key (keep default)
             "",               # Langfuse secret key (keep default)
             "",               # Langfuse public key (keep default)
         ]
@@ -360,6 +365,9 @@ def test_rerun_preserves_non_ragtime_lines(self, mock_input, mock_getpass):
             "",               # Recovery agent enabled (keep default)
             "",               # Recovery agent model (keep default)
             "",               # Recovery agent timeout (keep default)
+            "",               # Linking agent enabled (keep default)
+            "",               # Linking agent model (keep default)
+            "",               # Linking agent batch size (keep default)
             "",               # Langfuse enabled (keep default)
             "",               # Langfuse host (keep default)
         ]

diff --git a/doc/README.md b/doc/README.md
@@ -71,30 +71,24 @@ New types can be added through Django admin; existing types can be deactivated (
 
 #### 8. 🧩 Resolve entities (status: `resolving`)
 
-**Entity Linking (NEL)** — maps extracted mentions to canonical entity records, deduplicating across chunks.
+**Entity Resolution** — maps extracted mentions to canonical entity records, deduplicating across chunks.
 
-Aggregates all extracted names across every chunk, then resolves **once per entity type** using LLM-based fuzzy matching against two sources:
+Aggregates all extracted names across every chunk, then resolves **once per entity type** using LLM-based fuzzy matching against **existing DB records** — preventing duplicates when the same entity was seen in a previous episode.
 
-1. **Existing DB records** — prevents duplicates when the same entity was seen in a previous episode.
-2. **[Wikidata](https://www.wikidata.org/) candidates** — searches by name and type, presenting candidates (with Q-IDs and descriptions) to the LLM for confirmation. Matched entities receive a `wikidata_id` for canonical identification.
+When no existing entities of a given type exist in the database, all extracted names are created as new `Entity` records directly (no LLM call needed — there is nothing to deduplicate against). When existing entities are present, the LLM resolves extracted names against them, considering spelling variants, language differences, and alternate names.
 
 **Example** — continuing from the extract step, suppose the episode's chunks collectively mention "Bird", "Charlie Parker", "Yardbird", and "Dizzy Gillespie":
 
-| Extracted mentions | Resolved to (canonical entity) | Wikidata ID |
-|---|---|---|
-| Bird, Charlie Parker, Yardbird | Charlie Parker | [Q103767](https://www.wikidata.org/wiki/Q103767) |
-| Dizzy Gillespie | Dizzy Gillespie | [Q49575](https://www.wikidata.org/wiki/Q49575) |
+| Extracted mentions | Resolved to (canonical entity) |
+|---|---|
+| Bird, Charlie Parker, Yardbird | Charlie Parker |
+| Dizzy Gillespie | Dizzy Gillespie |
 
 All three surface forms collapse into a single `Entity` record for Charlie Parker. An `EntityMention` is created for each (entity, chunk) pair, preserving which chunks mentioned the entity and the context of each mention.
 
-This two-phase design (extract then resolve) is intentional: extraction is cheap and parallelizable per chunk, while resolution requires cross-chunk aggregation and knowledge base lookups. It also allows re-running resolution independently — e.g., after improving matching logic — without re-extracting.
-
-Search Wikidata from the CLI with:
+This two-phase design (extract then resolve) is intentional: extraction is cheap and parallelizable per chunk, while resolution requires cross-chunk aggregation. It also allows re-running resolution independently — e.g., after improving matching logic — without re-extracting.
 
-```
-uv run python manage.py lookup_entity "Miles Davis"
-uv run python manage.py lookup_entity --type musician "Miles Davis"
-```
+Wikidata Q-IDs are assigned asynchronously by the [linking agent](#linking-agent) after the resolve step completes — see below.
 
 #### 9. 📐 Embed (status: `embedding`) — *planned, not yet implemented*
 
@@ -144,6 +138,46 @@ The agent runs as a single [`agent.run()`](https://ai.pydantic.dev/agents/#runni
 
 The chain order is configured in [`settings.py`](../ragtime/settings.py), and the maximum retry count (default: 5) is controlled by the `MAX_RECOVERY_ATTEMPTS` constant in [`episodes/recovery.py`](../episodes/recovery.py). The system prompt and tool registration are in [`episodes/agents/agent.py`](../episodes/agents/agent.py). The agent tools — `navigate_to_url`, `find_audio_links`, `click_element`, `download_file`, `translate_text`, `analyze_screenshot`, `click_at_coordinates`, `intercept_audio_requests`, and others — are defined in [`episodes/agents/tools.py`](../episodes/agents/tools.py).
 
+### Linking Agent
+
+After the resolve step completes, a **linking agent** runs asynchronously to enrich entities with [Wikidata](https://www.wikidata.org/) Q-IDs. This is **not** a pipeline step — it never blocks episode processing. The pipeline continues to the embed step immediately while the linking agent works in the background.
+
+The linking agent is a [Pydantic AI](https://ai.pydantic.dev/) agent that:
+1. Picks up entities with `linking_status = "pending"`
+2. Searches Wikidata for candidates matching each entity's name and type
+3. Uses LLM reasoning to disambiguate candidates (e.g., "Blue Note" — jazz club vs. record label)
+4. Links entities to Q-IDs or marks them as failed/skipped
+
+Each `Entity` record tracks its linking state via the `linking_status` field:
+
+| Status | Meaning |
+|--------|---------|
+| `pending` | Not yet processed by the linking agent |
+| `linked` | Successfully linked to a Wikidata Q-ID |
+| `skipped` | Entity type has no Wikidata class Q-ID |
+| `failed` | No suitable Wikidata match found |
+
-| `pending` | Not yet processed by the linking agent |
-| `linked` | Successfully linked to a Wikidata Q-ID |
-| `skipped` | Entity type has no Wikidata class Q-ID |
-| `failed` | No suitable Wikidata match found |
+| `pending` | Not yet processed by the linking agent |
+| `linking` | Claimed by a worker and currently being processed by the linking agent (normally a short-lived, in-progress state) |
+| `linked` | Successfully linked to a Wikidata Q-ID |
+| `skipped` | Entity type has no Wikidata class Q-ID |
+| `failed` | No suitable Wikidata match found |
+
+The `linking` state is transient and usually clears quickly. If you see entities stuck in `linking` for an extended period (for example after a worker crash or deployment), you can safely reset their `linking_status` back to `pending` via the Django admin or a management command/CLI script; the linking agent will pick them up again on the next run.
-| `pending` | Not yet processed by the linking agent |
-| `linked` | Successfully linked to a Wikidata Q-ID |
-| `skipped` | Entity type has no Wikidata class Q-ID |
-| `failed` | No suitable Wikidata match found |
+| `pending` | Not yet processed by the linking agent |
+| `linking` | Claimed by a worker and currently being processed by the linking agent (normally a short-lived, in-progress state) |
+| `linked` | Successfully linked to a Wikidata Q-ID |
+| `skipped` | Entity type has no Wikidata class Q-ID |
+| `failed` | No suitable Wikidata match found |
+
+The `linking` state is transient and usually clears quickly. If you see entities stuck in `linking` for an extended period (for example after a worker crash or deployment), you can safely reset their `linking_status` back to `pending` via the Django admin or a management command/CLI script; the linking agent will pick them up again on the next run.
+The agent is triggered by the `step_completed` signal when the resolve step finishes. It processes all pending entities (not just those from the current episode), working in configurable batch sizes.
+
+The linking agent is **on by default**. Configure via the wizard or set these variables in `.env`:
+```
+RAGTIME_LINKING_AGENT_ENABLED=true
+RAGTIME_LINKING_AGENT_API_KEY=sk-your-key
+RAGTIME_LINKING_AGENT_MODEL=openai:gpt-4.1-mini
+RAGTIME_LINKING_AGENT_BATCH_SIZE=50
+```
+
+Link entities manually from the CLI:
+```bash
+uv run python manage.py link_entities              # Link all pending
+uv run python manage.py link_entities --retry       # Reset failed → pending, re-link
+uv run python manage.py link_entities --type musician  # Link specific type only
+```
+
+Failed or skipped entities can also be retried from Django admin using the "Retry Wikidata linking" action.
+
+The agent's tools — `search_wikidata`, `link_entity`, `mark_failed`, and `skip_entity` — are defined in [`episodes/agents/linker_tools.py`](../episodes/agents/linker_tools.py). The system prompt and signal handler are in [`episodes/agents/linker.py`](../episodes/agents/linker.py).
+
 ## How Scott Works
 
 Scott is a strict RAG (Retrieval-Augmented Generation) agent:
@@ -159,7 +193,7 @@ Scott responds in the user's language, regardless of the source episode's langua
 
 ## Wikidata Cache
 
-Wikidata API responses are cached to avoid repeated requests during entity resolution. Each unique entity name can trigger up to 11 API requests (1 search + up to 10 detail lookups), so caching is critical for performance and to avoid IP rate-limiting.
+Wikidata API responses are cached to avoid repeated requests during entity linking. Each unique entity name can trigger up to 11 API requests (1 search + up to 10 detail lookups), so caching is critical for performance and to avoid IP rate-limiting.
 
 | Setting | Default | Description |
 |---------|---------|-------------|

diff --git a/doc/features/2026-03-31-linking-agent.md b/doc/features/2026-03-31-linking-agent.md
@@ -0,0 +1,84 @@
+# Decouple Wikidata Linking into Background Linking Agent
+
+**Date:** 2026-03-31
+
+## Problem
+
+The resolve pipeline step called the Wikidata API synchronously for every entity, causing timeouts on episodes with many entities. The Wikidata Q-ID is not needed for embedding or RAG queries, so linking can be deferred.
+
+## Changes
+
+### Entity model (`episodes/models.py`)
+
+Added `LinkingStatus` choices (`pending`, `linked`, `skipped`, `failed`) and `linking_status` field to `Entity`. Data migration sets existing entities with `wikidata_id` to `linked`.
-Added `LinkingStatus` choices (`pending`, `linked`, `skipped`, `failed`) and `linking_status` field to `Entity`. Data migration sets existing entities with `wikidata_id` to `linked`.
+Added `LinkingStatus` choices (`pending`, `linking`, `linked`, `skipped`, `failed`) and `linking_status` field to `Entity`. `linking` represents an in-progress Wikidata linking operation. Data migration sets existing entities with `wikidata_id` to `linked`.
-Added `LinkingStatus` choices (`pending`, `linked`, `skipped`, `failed`) and `linking_status` field to `Entity`. Data migration sets existing entities with `wikidata_id` to `linked`.
+Added `LinkingStatus` choices (`pending`, `linking`, `linked`, `skipped`, `failed`) and `linking_status` field to `Entity`. `linking` represents an in-progress Wikidata linking operation. Data migration sets existing entities with `wikidata_id` to `linked`.
+
+### Resolver simplified (`episodes/resolver.py`)
+
+- Removed `_fetch_wikidata_candidates()` function entirely
+- Removed Wikidata candidates section from `_build_system_prompt()`
+- "No existing entities" branch now creates all entities directly without LLM call
+- "Existing entities" branch still uses LLM for deduplication but without Wikidata API calls
+
+### Linking agent (new: `episodes/agents/linker.py`, `linker_tools.py`, `linker_deps.py`)
+
+Pydantic AI agent that processes pending entities in batches:
+1. Skips entities whose type has no Wikidata class Q-ID
+2. Searches Wikidata for candidates via `search_wikidata` tool
+3. Uses LLM to disambiguate and link via `link_entity` / `mark_failed` / `skip_entity` tools
+4. Queues follow-up tasks for remaining entities
+
+Triggered by `step_completed` signal when RESOLVING finishes, connected in `apps.py:ready()`.
+
+### Management command (`episodes/management/commands/link_entities.py`)
+
+```bash
+uv run python manage.py link_entities              # Link all pending
+uv run python manage.py link_entities --retry       # Reset failed → pending
+uv run python manage.py link_entities --type musician
+```
+
+### Admin (`episodes/admin.py`)
+
+Added `linking_status` to `EntityAdmin` display/filters. "Retry Wikidata linking" action resets selected entities to pending.
+
+### Configuration
+
+New settings: `RAGTIME_LINKING_AGENT_ENABLED` (default: true), `RAGTIME_LINKING_AGENT_API_KEY`, `RAGTIME_LINKING_AGENT_MODEL` (default: `openai:gpt-4.1-mini`), `RAGTIME_LINKING_AGENT_BATCH_SIZE` (default: 50).
+
+## Key Parameters
+
+| Parameter | Value | Rationale |
+|---|---|---|
+| `RAGTIME_LINKING_AGENT_BATCH_SIZE` | 50 | Balances LLM context usage and throughput |
+| `RAGTIME_LINKING_AGENT_ENABLED` | true | On by default since it's non-blocking |
+| Pydantic AI `request_limit` | 50 | Higher than recovery agent (30) since linking processes more entities |
+
+## Verification
+
+1. `uv run python manage.py check` — no issues
+2. `uv run python manage.py test episodes` — all tests pass
+3. Process an episode — resolve step completes without Wikidata calls
+4. `uv run python manage.py link_entities` — links pending entities
+5. Admin shows `linking_status` column with correct values
+
+## Files Modified
+
+| File | Change |
+|---|---|
+| `episodes/models.py` | Add `LinkingStatus` choices and `linking_status` field |
+| `episodes/resolver.py` | Remove Wikidata API calls, simplify both resolution branches |
+| `episodes/agents/linker.py` | New — Pydantic AI linking agent, signal handler |
+| `episodes/agents/linker_deps.py` | New — `LinkingDeps` and `LinkingAgentResult` |
+| `episodes/agents/linker_tools.py` | New — agent tools wrapping wikidata.py |
+| `episodes/apps.py` | Connect `step_completed` signal to linking handler |
+| `episodes/admin.py` | Add linking_status display, filters, retry action |
+| `episodes/management/commands/link_entities.py` | New — CLI command |
+| `episodes/migrations/0020_add_entity_linking_status.py` | Schema migration |
+| `episodes/migrations/0021_set_linking_status_for_existing.py` | Data migration |
+| `episodes/tests/test_resolve.py` | Remove Wikidata patches, rework affected tests |
+| `episodes/tests/test_linker.py` | New — linking agent tests |
+| `.env.sample` | Add `RAGTIME_LINKING_AGENT_*` variables |
+| `core/management/commands/_configure_helpers.py` | Add Linking Agent to wizard |
+| `README.md` | Update pipeline table, add linking agent note |
+| `doc/README.md` | Rewrite resolve step, add Linking Agent section |
+| `CHANGELOG.md` | Add entry |