Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 10 additions & 0 deletions .env.sample
Original file line number Diff line number Diff line change
Expand Up @@ -82,6 +82,16 @@ RAGTIME_RECOVERY_AGENT_MODEL=
# Timeout in seconds for recovery agent attempts (default: 120)
RAGTIME_RECOVERY_AGENT_TIMEOUT=

# Linking agent — async Wikidata entity linking after pipeline resolve step
# Enable the linking agent (true/false, default: true)
RAGTIME_LINKING_AGENT_ENABLED=
# API key for the linking agent LLM provider
RAGTIME_LINKING_AGENT_API_KEY=
# Pydantic AI model string (default: openai:gpt-4.1-mini)
RAGTIME_LINKING_AGENT_MODEL=
# Batch size for linking agent (default: 50)
RAGTIME_LINKING_AGENT_BATCH_SIZE=

# Vector store backend (chroma, etc.)
RAGTIME_VECTOR_STORE=
# ChromaDB server host (default: localhost, omit for embedded/local mode)
Expand Down
8 changes: 8 additions & 0 deletions AGENTS.md
Original file line number Diff line number Diff line change
Expand Up @@ -98,6 +98,14 @@ The commit for a given feature MUST contain the plan, the feature documentation,

## PR Creation

Before creating a PR, run the full test suite and verify it passes:

```bash
uv run python manage.py test --verbosity 2
```

Do not create the PR if tests are failing. Fix the failures first.

When creating PRs, ensure the PR includes: plan document, feature doc, session transcripts (planning + implementation), and changelog entry. Review the Documentation section above for full requirements before creating the PR.

## GitHub API (`gh`)
Expand Down
6 changes: 6 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,12 @@ All notable changes to this project will be documented in this file.

The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.1.0/).

## 2026-03-31

### Added

- Background linking agent — decouple Wikidata entity linking from the pipeline resolve step into an asynchronous Pydantic AI agent. The resolve step now performs pure LLM-based entity deduplication without external API calls, eliminating Wikidata timeouts. A linking agent runs in the background after resolve completes, enriching entities with Wikidata Q-IDs using LLM-based candidate disambiguation. Adds `linking_status` field to Entity model, `link_entities` management command, admin retry action, and `RAGTIME_LINKING_AGENT_*` configuration — [plan](doc/plans/2026-03-31-linking-agent.md), [feature](doc/features/2026-03-31-linking-agent.md), [planning session](doc/sessions/2026-03-31-linking-agent-planning-session.md), [implementation session](doc/sessions/2026-03-31-linking-agent-implementation-session.md)

## 2026-03-23

### Fixed
Expand Down
6 changes: 4 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,7 @@ RAGtime is a Django application for ingesting jazz-related podcast episodes. It

- 🎙️ **Episode Ingestion** — Add podcast episodes by URL. RAGtime scrapes metadata (title, description, date, image), downloads audio, and processes it through the pipeline.
- 📝 **Multilingual Transcription** — Transcribes episodes using configurable backends (Whisper API by default) with segment and word-level timestamps. Supports multiple languages (English, Spanish, German, Swedish, etc.).
- 🔍 **Entity Extraction** — Identifies jazz entities: musicians, musical groups, albums, music venues, recording sessions, record labels, years. Entities are resolved against existing records using LLM-based matching.
- 🔍 **Entity Extraction** — Identifies jazz entities: musicians, musical groups, albums, music venues, recording sessions, record labels, years. Entities are resolved against existing records using LLM-based matching. A background linking agent asynchronously enriches entities with Wikidata Q-IDs without blocking the pipeline.
- 📇 **Episode Indexing** — Splits transcripts into segments and generates multilingual embeddings stored in ChromaDB. Enables cross-language semantic search so Scott can retrieve relevant content regardless of the question's language.
- 🎷 **Scott — Your Jazz AI** — A conversational agent that answers questions strictly from ingested episode content. Scott responds in the user's language and provides references to specific episodes and timestamps. Responses stream in real-time.
- 📊 **AI Evaluation** — Measures pipeline and Scott quality using [RAGAS](https://docs.ragas.io/) (faithfulness, answer relevancy, context precision/recall) with scores tracked in [Langfuse](https://langfuse.com/docs/scores/model-based-evals/ragas).
Expand Down Expand Up @@ -65,12 +65,14 @@ Each step updates the episode's `status` field. A `post_save` signal dispatches
| 5 | 📋 Summarize | `summarizing` | LLM-generated episode summary |
| 6 | ✂️ Chunk | `chunking` | Split transcript into ~150-word chunks |
| 7 | 🔍 Extract | `extracting` | Named entity recognition per chunk |
| 8 | 🧩 Resolve | `resolving` | Entity linking and deduplication via Wikidata |
| 8 | 🧩 Resolve | `resolving` | LLM-based entity deduplication against existing DB records |
| 9 | 📐 Embed | `embedding` | Multilingual embeddings into ChromaDB |
| 10 | ✅ Ready | `ready` | Episode available for Scott to query |

_Steps 9–10 (Embed, Ready) are planned and not yet implemented._

After the resolve step completes, a **linking agent** runs asynchronously to enrich entities with [Wikidata](https://www.wikidata.org/) Q-IDs. This is not a pipeline step — it never blocks episode processing. See the [linking agent documentation](doc/README.md#linking-agent) for details.

See the [full pipeline documentation](doc/README.md) for per-step details, entity types, and the recovery layer.

## Documentation
Expand Down
17 changes: 17 additions & 0 deletions core/management/commands/_configure_helpers.py
Original file line number Diff line number Diff line change
Expand Up @@ -124,6 +124,23 @@
},
],
},
{
"name": "Linking Agent",
"description": "Async Wikidata entity linking after pipeline resolve step",
"shareable": False,
"subsystems": [
{
"prefix": "RAGTIME_LINKING",
"label": "Linking Agent",
"fields": [
("AGENT_ENABLED", "true", False),
("AGENT_API_KEY", "", True),
("AGENT_MODEL", "openai:gpt-4.1-mini", False),
("AGENT_BATCH_SIZE", "50", False),
],
},
],
},
{
"name": "LLM Observability",
"description": "Langfuse tracing for LLM calls (optional)",
Expand Down
8 changes: 8 additions & 0 deletions core/tests/test_configure.py
Original file line number Diff line number Diff line change
Expand Up @@ -230,6 +230,7 @@ def test_shared_mode_wizard(self, mock_input, mock_getpass):
"sk-newkey123", # Shared LLM API key
"sk-newkey123", # Transcription API key
"", # Recovery agent API key (keep default)
"", # Linking agent API key (keep default)
"", # Langfuse secret key (keep default)
"", # Langfuse public key (keep default)
]
Expand All @@ -255,6 +256,9 @@ def test_shared_mode_wizard(self, mock_input, mock_getpass):
"", # Recovery agent enabled (keep default)
"", # Recovery agent model (keep default)
"", # Recovery agent timeout (keep default)
"", # Linking agent enabled (keep default)
"", # Linking agent model (keep default)
"", # Linking agent batch size (keep default)
"", # Langfuse enabled (keep default)
"", # Langfuse host (keep default)
]
Expand Down Expand Up @@ -335,6 +339,7 @@ def test_rerun_preserves_non_ragtime_lines(self, mock_input, mock_getpass):
"sk-newkey123", # Shared LLM API key
"sk-newkey123", # Transcription API key
"", # Recovery agent API key (keep default)
"", # Linking agent API key (keep default)
"", # Langfuse secret key (keep default)
"", # Langfuse public key (keep default)
]
Expand All @@ -360,6 +365,9 @@ def test_rerun_preserves_non_ragtime_lines(self, mock_input, mock_getpass):
"", # Recovery agent enabled (keep default)
"", # Recovery agent model (keep default)
"", # Recovery agent timeout (keep default)
"", # Linking agent enabled (keep default)
"", # Linking agent model (keep default)
"", # Linking agent batch size (keep default)
"", # Langfuse enabled (keep default)
"", # Langfuse host (keep default)
]
Expand Down
66 changes: 50 additions & 16 deletions doc/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -71,30 +71,24 @@ New types can be added through Django admin; existing types can be deactivated (

#### 8. 🧩 Resolve entities (status: `resolving`)

**Entity Linking (NEL)** — maps extracted mentions to canonical entity records, deduplicating across chunks.
**Entity Resolution** — maps extracted mentions to canonical entity records, deduplicating across chunks.

Aggregates all extracted names across every chunk, then resolves **once per entity type** using LLM-based fuzzy matching against two sources:
Aggregates all extracted names across every chunk, then resolves **once per entity type** using LLM-based fuzzy matching against **existing DB records** — preventing duplicates when the same entity was seen in a previous episode.

1. **Existing DB records** — prevents duplicates when the same entity was seen in a previous episode.
2. **[Wikidata](https://www.wikidata.org/) candidates** — searches by name and type, presenting candidates (with Q-IDs and descriptions) to the LLM for confirmation. Matched entities receive a `wikidata_id` for canonical identification.
When no existing entities of a given type exist in the database, all extracted names are created as new `Entity` records directly (no LLM call needed — there is nothing to deduplicate against). When existing entities are present, the LLM resolves extracted names against them, considering spelling variants, language differences, and alternate names.

**Example** — continuing from the extract step, suppose the episode's chunks collectively mention "Bird", "Charlie Parker", "Yardbird", and "Dizzy Gillespie":

| Extracted mentions | Resolved to (canonical entity) | Wikidata ID |
|---|---|---|
| Bird, Charlie Parker, Yardbird | Charlie Parker | [Q103767](https://www.wikidata.org/wiki/Q103767) |
| Dizzy Gillespie | Dizzy Gillespie | [Q49575](https://www.wikidata.org/wiki/Q49575) |
| Extracted mentions | Resolved to (canonical entity) |
|---|---|
| Bird, Charlie Parker, Yardbird | Charlie Parker |
| Dizzy Gillespie | Dizzy Gillespie |

All three surface forms collapse into a single `Entity` record for Charlie Parker. An `EntityMention` is created for each (entity, chunk) pair, preserving which chunks mentioned the entity and the context of each mention.

This two-phase design (extract then resolve) is intentional: extraction is cheap and parallelizable per chunk, while resolution requires cross-chunk aggregation and knowledge base lookups. It also allows re-running resolution independently — e.g., after improving matching logic — without re-extracting.

Search Wikidata from the CLI with:
This two-phase design (extract then resolve) is intentional: extraction is cheap and parallelizable per chunk, while resolution requires cross-chunk aggregation. It also allows re-running resolution independently — e.g., after improving matching logic — without re-extracting.

```
uv run python manage.py lookup_entity "Miles Davis"
uv run python manage.py lookup_entity --type musician "Miles Davis"
```
Wikidata Q-IDs are assigned asynchronously by the [linking agent](#linking-agent) after the resolve step completes — see below.

#### 9. 📐 Embed (status: `embedding`) — *planned, not yet implemented*

Expand Down Expand Up @@ -144,6 +138,46 @@ The agent runs as a single [`agent.run()`](https://ai.pydantic.dev/agents/#runni

The chain order is configured in [`settings.py`](../ragtime/settings.py), and the maximum retry count (default: 5) is controlled by the `MAX_RECOVERY_ATTEMPTS` constant in [`episodes/recovery.py`](../episodes/recovery.py). The system prompt and tool registration are in [`episodes/agents/agent.py`](../episodes/agents/agent.py). The agent tools — `navigate_to_url`, `find_audio_links`, `click_element`, `download_file`, `translate_text`, `analyze_screenshot`, `click_at_coordinates`, `intercept_audio_requests`, and others — are defined in [`episodes/agents/tools.py`](../episodes/agents/tools.py).

### Linking Agent

After the resolve step completes, a **linking agent** runs asynchronously to enrich entities with [Wikidata](https://www.wikidata.org/) Q-IDs. This is **not** a pipeline step — it never blocks episode processing. The pipeline continues to the embed step immediately while the linking agent works in the background.

The linking agent is a [Pydantic AI](https://ai.pydantic.dev/) agent that:
1. Picks up entities with `linking_status = "pending"`
2. Searches Wikidata for candidates matching each entity's name and type
3. Uses LLM reasoning to disambiguate candidates (e.g., "Blue Note" — jazz club vs. record label)
4. Links entities to Q-IDs or marks them as failed/skipped

Each `Entity` record tracks its linking state via the `linking_status` field:

| Status | Meaning |
|--------|---------|
| `pending` | Not yet processed by the linking agent |
| `linked` | Successfully linked to a Wikidata Q-ID |
| `skipped` | Entity type has no Wikidata class Q-ID |
| `failed` | No suitable Wikidata match found |

Comment on lines +155 to +159
Copy link

Copilot AI Mar 31, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The linking agent documentation omits the linking status even though it’s a real Entity.linking_status choice and will show up in admin/filters. Consider documenting what linking means (claimed/in-progress) and how to recover if entities get stuck in that state (e.g., reset to pending via admin/CLI).

Suggested change
| `pending` | Not yet processed by the linking agent |
| `linked` | Successfully linked to a Wikidata Q-ID |
| `skipped` | Entity type has no Wikidata class Q-ID |
| `failed` | No suitable Wikidata match found |
| `pending` | Not yet processed by the linking agent |
| `linking` | Claimed by a worker and currently being processed by the linking agent (normally a short-lived, in-progress state) |
| `linked` | Successfully linked to a Wikidata Q-ID |
| `skipped` | Entity type has no Wikidata class Q-ID |
| `failed` | No suitable Wikidata match found |
The `linking` state is transient and usually clears quickly. If you see entities stuck in `linking` for an extended period (for example after a worker crash or deployment), you can safely reset their `linking_status` back to `pending` via the Django admin or a management command/CLI script; the linking agent will pick them up again on the next run.

Copilot uses AI. Check for mistakes.
The agent is triggered by the `step_completed` signal when the resolve step finishes. It processes all pending entities (not just those from the current episode), working in configurable batch sizes.

The linking agent is **on by default**. Configure via the wizard or set these variables in `.env`:
```
RAGTIME_LINKING_AGENT_ENABLED=true
RAGTIME_LINKING_AGENT_API_KEY=sk-your-key
RAGTIME_LINKING_AGENT_MODEL=openai:gpt-4.1-mini
RAGTIME_LINKING_AGENT_BATCH_SIZE=50
```

Link entities manually from the CLI:
```bash
uv run python manage.py link_entities # Link all pending
uv run python manage.py link_entities --retry # Reset failed → pending, re-link
uv run python manage.py link_entities --type musician # Link specific type only
```

Failed or skipped entities can also be retried from Django admin using the "Retry Wikidata linking" action.

The agent's tools — `search_wikidata`, `link_entity`, `mark_failed`, and `skip_entity` — are defined in [`episodes/agents/linker_tools.py`](../episodes/agents/linker_tools.py). The system prompt and signal handler are in [`episodes/agents/linker.py`](../episodes/agents/linker.py).

## How Scott Works

Scott is a strict RAG (Retrieval-Augmented Generation) agent:
Expand All @@ -159,7 +193,7 @@ Scott responds in the user's language, regardless of the source episode's langua

## Wikidata Cache

Wikidata API responses are cached to avoid repeated requests during entity resolution. Each unique entity name can trigger up to 11 API requests (1 search + up to 10 detail lookups), so caching is critical for performance and to avoid IP rate-limiting.
Wikidata API responses are cached to avoid repeated requests during entity linking. Each unique entity name can trigger up to 11 API requests (1 search + up to 10 detail lookups), so caching is critical for performance and to avoid IP rate-limiting.

| Setting | Default | Description |
|---------|---------|-------------|
Expand Down
84 changes: 84 additions & 0 deletions doc/features/2026-03-31-linking-agent.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,84 @@
# Decouple Wikidata Linking into Background Linking Agent

**Date:** 2026-03-31

## Problem

The resolve pipeline step called the Wikidata API synchronously for every entity, causing timeouts on episodes with many entities. The Wikidata Q-ID is not needed for embedding or RAG queries, so linking can be deferred.

## Changes

### Entity model (`episodes/models.py`)

Added `LinkingStatus` choices (`pending`, `linked`, `skipped`, `failed`) and `linking_status` field to `Entity`. Data migration sets existing entities with `wikidata_id` to `linked`.
Copy link

Copilot AI Mar 31, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feature doc lists LinkingStatus choices as (pending, linked, skipped, failed), but the model also includes linking (in-progress) and there’s a migration altering the field choices accordingly. Please update the doc to include linking (or explain if it’s intentionally internal-only).

Suggested change
Added `LinkingStatus` choices (`pending`, `linked`, `skipped`, `failed`) and `linking_status` field to `Entity`. Data migration sets existing entities with `wikidata_id` to `linked`.
Added `LinkingStatus` choices (`pending`, `linking`, `linked`, `skipped`, `failed`) and `linking_status` field to `Entity`. `linking` represents an in-progress Wikidata linking operation. Data migration sets existing entities with `wikidata_id` to `linked`.

Copilot uses AI. Check for mistakes.

### Resolver simplified (`episodes/resolver.py`)

- Removed `_fetch_wikidata_candidates()` function entirely
- Removed Wikidata candidates section from `_build_system_prompt()`
- "No existing entities" branch now creates all entities directly without LLM call
- "Existing entities" branch still uses LLM for deduplication but without Wikidata API calls

### Linking agent (new: `episodes/agents/linker.py`, `linker_tools.py`, `linker_deps.py`)

Pydantic AI agent that processes pending entities in batches:
1. Skips entities whose type has no Wikidata class Q-ID
2. Searches Wikidata for candidates via `search_wikidata` tool
3. Uses LLM to disambiguate and link via `link_entity` / `mark_failed` / `skip_entity` tools
4. Queues follow-up tasks for remaining entities

Triggered by `step_completed` signal when RESOLVING finishes, connected in `apps.py:ready()`.

### Management command (`episodes/management/commands/link_entities.py`)

```bash
uv run python manage.py link_entities # Link all pending
uv run python manage.py link_entities --retry # Reset failed → pending
uv run python manage.py link_entities --type musician
```

### Admin (`episodes/admin.py`)

Added `linking_status` to `EntityAdmin` display/filters. "Retry Wikidata linking" action resets selected entities to pending.

### Configuration

New settings: `RAGTIME_LINKING_AGENT_ENABLED` (default: true), `RAGTIME_LINKING_AGENT_API_KEY`, `RAGTIME_LINKING_AGENT_MODEL` (default: `openai:gpt-4.1-mini`), `RAGTIME_LINKING_AGENT_BATCH_SIZE` (default: 50).

## Key Parameters

| Parameter | Value | Rationale |
|---|---|---|
| `RAGTIME_LINKING_AGENT_BATCH_SIZE` | 50 | Balances LLM context usage and throughput |
| `RAGTIME_LINKING_AGENT_ENABLED` | true | On by default since it's non-blocking |
| Pydantic AI `request_limit` | 50 | Higher than recovery agent (30) since linking processes more entities |

## Verification

1. `uv run python manage.py check` — no issues
2. `uv run python manage.py test episodes` — all tests pass
3. Process an episode — resolve step completes without Wikidata calls
4. `uv run python manage.py link_entities` — links pending entities
5. Admin shows `linking_status` column with correct values

## Files Modified

| File | Change |
|---|---|
| `episodes/models.py` | Add `LinkingStatus` choices and `linking_status` field |
| `episodes/resolver.py` | Remove Wikidata API calls, simplify both resolution branches |
| `episodes/agents/linker.py` | New — Pydantic AI linking agent, signal handler |
| `episodes/agents/linker_deps.py` | New — `LinkingDeps` and `LinkingAgentResult` |
| `episodes/agents/linker_tools.py` | New — agent tools wrapping wikidata.py |
| `episodes/apps.py` | Connect `step_completed` signal to linking handler |
| `episodes/admin.py` | Add linking_status display, filters, retry action |
| `episodes/management/commands/link_entities.py` | New — CLI command |
| `episodes/migrations/0020_add_entity_linking_status.py` | Schema migration |
| `episodes/migrations/0021_set_linking_status_for_existing.py` | Data migration |
| `episodes/tests/test_resolve.py` | Remove Wikidata patches, rework affected tests |
| `episodes/tests/test_linker.py` | New — linking agent tests |
| `.env.sample` | Add `RAGTIME_LINKING_AGENT_*` variables |
| `core/management/commands/_configure_helpers.py` | Add Linking Agent to wizard |
| `README.md` | Update pipeline table, add linking agent note |
| `doc/README.md` | Rewrite resolve step, add Linking Agent section |
| `CHANGELOG.md` | Add entry |
Loading
Loading