Migrate observability to OpenTelemetry and pipeline to LangGraph#88
Migrate observability to OpenTelemetry and pipeline to LangGraph#88
Conversation
Create episodes/telemetry.py with OTel TracerProvider, @trace_step and @trace_provider decorators, and record_llm_input/output span event helpers. Replace all @observe_step/@observe_provider usage across step files and providers. Replace Langfuse context propagation in recovery agent with OTel spans. Replace Langfuse screenshot media attachments with OTel span events. Replace RAGTIME_LANGFUSE_* env vars with RAGTIME_OTEL_* (exporter, endpoint, service name, headers). Update configure wizard for new OTel fields. Delete episodes/observability.py (256 lines of Langfuse-specific code). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Create episodes/graph/ package with a LangGraph StateGraph that orchestrates the ingestion pipeline. The graph has 10 nodes (route, scrape, download, transcribe, summarize, chunk, extract, resolve, embed, recovery) with conditional edges for step skipping and recovery routing. An entry router inspects episode data to determine where to start, enabling resume-from-failure. Remove queue_next_step post_save signal handler and Django Q2 dependency. Replace async_task() in admin with threading.Thread + run_pipeline(). Add handle_step_failure_from_graph() to recovery.py for graph-initiated recovery. Remove Q_CLUSTER config and django_q from INSTALLED_APPS. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Remove ~100 async_task patches across 13 test files — no longer needed since Django Q2 signal dispatch was removed. Delete test_signals.py (tested old queue_next_step handler) and test_observability.py (tested Langfuse-specific code). Create test_telemetry.py for OTel telemetry tests. Update test_configure.py for OTel fields. Fix test_scraper.py and test_recovery.py to create ProcessingRun explicitly (previously created by signal handler). Update test_admin.py to mock threading instead of async_task. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add langgraph.json server config and episodes/graph/server.py (Django setup for LangGraph server). Run `langgraph dev` to start the server and connect via LangGraph Studio desktop app for graph visualization and execution inspection. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Update README.md: replace Langfuse with OpenTelemetry, Django Q2 with LangGraph in features, pipeline description, tech stack, and setup. Update doc/README.md: rewrite pipeline orchestration, recovery handler reference, and observability section (Langfuse → OTel with OTLP setup). Update CLAUDE.md: replace Langfuse/Django Q2 with OTel/LangGraph. Add changelog entry, plan doc, feature doc, and session transcripts. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
8d175d6 to
2e69b36
Compare
Add langgraph dev command to CLAUDE.md commands section. Update README Getting Started to include langgraph dev as optional step and clarify that the pipeline runs in-process via LangGraph (no separate worker). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
This PR migrates RAGtime’s pipeline observability from Langfuse-specific instrumentation to OpenTelemetry, and replaces the Django Q2 + post_save signal-driven pipeline dispatch with a LangGraph StateGraph orchestration layer (including LangGraph Studio support).
Changes:
- Replace Langfuse observability (
episodes/observability.py) with an OpenTelemetry-based module (episodes/telemetry.py) and update step/provider instrumentation accordingly. - Replace Django Q2 pipeline dispatch (signals +
async_task) with a LangGraph pipeline (episodes/graph/*) and update admin-triggered pipeline runs to execute via background threads. - Update configuration/docs/tests to reflect new OTel env vars (
RAGTIME_OTEL_*) and new orchestration model (LangGraph).
Reviewed changes
Copilot reviewed 49 out of 51 changed files in this pull request and generated 10 comments.
Show a summary per file
| File | Description |
|---|---|
| uv.lock | Updates resolved dependencies: removes django-q2/langfuse, adds langgraph + langchain-core deps and OTel libs. |
| pyproject.toml | Removes django-q2, adds langgraph, opentelemetry-* (and OTLP exporter in observability extra). |
| ragtime/settings.py | Removes django_q/Q_CLUSTER; replaces RAGTIME_LANGFUSE_* with RAGTIME_OTEL_*. |
| .env.sample | Replaces Langfuse env vars with OpenTelemetry env vars. |
| core/management/commands/_configure_helpers.py | Updates configure wizard sections/fields to OpenTelemetry settings. |
| core/tests/test_configure.py | Updates wizard tests to match new OTel prompts/inputs. |
| episodes/telemetry.py | New OTel tracing setup + decorators + span event helpers for LLM input/output. |
| episodes/observability.py | Deleted Langfuse-specific observability implementation. |
| episodes/providers/openai.py | Switches to plain openai.OpenAI and OTel span events/decorators. |
| episodes/scraper.py | Replaces @observe_step with @trace_step. |
| episodes/downloader.py | Not directly changed in diff shown, but now invoked by graph nodes instead of signals/Q2. |
| episodes/transcriber.py | Replaces @observe_step with @trace_step. |
| episodes/summarizer.py | Replaces @observe_step with @trace_step. |
| episodes/extractor.py | Replaces @observe_step with @trace_step. |
| episodes/resolver.py | Replaces @observe_step with @trace_step. |
| episodes/signals.py | Removes post_save pipeline dispatch; keeps only step_completed/step_failed. |
| episodes/apps.py | Stops auto-connecting recovery handler; calls telemetry setup on app ready. |
| episodes/recovery.py | Adds handle_step_failure_from_graph() for graph-initiated recovery. |
| episodes/admin.py | Replaces Q2 async_task usage with threading.Thread to run graph + recovery tasks. |
| episodes/agents/agent.py | Reworks recovery agent tracing to use OTel spans and flush OTel provider. |
| episodes/agents/tools.py | Records screenshots as OTel span events (replacing Langfuse media attachment). |
| episodes/graph/state.py | Defines EpisodeState schema for LangGraph state passing. |
| episodes/graph/edges.py | Entry routing + after-step conditional routing (including recovery routing). |
| episodes/graph/nodes.py | Wraps existing step functions as graph nodes + adds embed/recovery nodes. |
| episodes/graph/pipeline.py | Defines and compiles the ingestion StateGraph. |
| episodes/graph/run.py | Adds run_pipeline() entry point that creates runs and invokes the compiled graph. |
| episodes/graph/server.py | Ensures django.setup() is run before LangGraph Studio imports the graph. |
| langgraph.json | Adds LangGraph Studio/server configuration pointing to the ingestion graph. |
| episodes/tests/test_telemetry.py | New tests for OTel helper behaviors and calling conventions. |
| episodes/tests/test_observability.py | Deleted Langfuse observability tests. |
| episodes/tests/test_signals.py | Deleted tests for now-removed Q2/signal dispatch behavior. |
| episodes/tests/test_admin.py | Updates admin tests to expect thread-based background execution. |
| episodes/tests/test_scraper.py | Updates tests for new run creation assumptions and status-based behavior. |
| episodes/tests/test_transcribe.py | Removes Q2 signal patching; calls step function directly. |
| episodes/tests/test_summarize.py | Removes Q2 signal patching; calls step function directly. |
| episodes/tests/test_chunk.py | Removes Q2 signal patching; calls step function directly. |
| episodes/tests/test_extract.py | Removes Q2 signal patching; calls step function directly. |
| episodes/tests/test_resolve.py | Removes Q2 signal patching; calls step function directly. |
| episodes/tests/test_download.py | Removes Q2 signal patching; calls step function directly. |
| episodes/tests/test_events.py | Removes Q2 signal patching; validates event emission directly. |
| episodes/tests/test_models.py | Removes Q2 signal patching from model tests. |
| episodes/tests/test_recovery.py | Updates recovery tests to manually connect signals where needed. |
| episodes/tests/test_agent_resume.py | Removes Q2 patching assumptions around resume behaviors. |
| README.md | Updates documentation to reference OpenTelemetry + LangGraph; removes qcluster instructions. |
| doc/README.md | Updates pipeline + observability docs for LangGraph and OpenTelemetry. |
| CHANGELOG.md | Adds a 2026-04-13 entry describing the migrations. |
| AGENTS.md | Updates agent guidance to reflect LangGraph + OpenTelemetry architecture. |
| doc/plans/2026-04-13-otel-langgraph-migration.md | Adds migration plan documentation. |
| doc/features/2026-04-13-otel-langgraph-migration.md | Adds feature documentation for the migration. |
| doc/sessions/2026-04-13-otel-langgraph-migration-implementation-session.md | Adds implementation session transcript. |
| doc/sessions/2026-04-13-otel-langgraph-migration-planning-session.md | Adds planning session transcript. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
The langgraph CLI (langgraph dev) is a separate package from the langgraph library. Add langgraph-cli[inmem] as a studio optional dependency group. Update all docs to use `uv run langgraph dev` and reference `uv sync --extra studio` for installation. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The langgraph CLI requires a 'dependencies' key in the config to know what to install. Use ["."] to reference the current project, which brings in all dependencies from pyproject.toml. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Point users to http://localhost:8000/admin/ after starting the server. Note that LangGraph Studio works locally without a LangSmith API key and the warning banner can be dismissed. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1. Add start_from param to run_pipeline/route_entry so admin reprocess overrides data-based routing instead of silently skipping steps 2. Add _close_connections() for DB connection management in background threads (gated on non-main thread to avoid breaking tests) 3. Route ALL step failures to recovery node (not just scraping/downloading) so RecoveryAttempt records and human escalation work for every step 4. Doc now accurate after fix #3 5. Preserve original error_type/http_status/exception_class from PipelineEvent using dataclasses.replace() in handle_step_failure_from_graph 6. Fix telemetry.py module docstring re: OTel import when disabled 7. embed_node() fails fast instead of marking READY without embeddings 8. Add test_graph.py with 12 tests for route_entry, after_step, after_recovery, and compiled graph integration 9. Add comment to apps.py explaining signals kept as extension points 10. Fix _build_exporter() docstring Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
High: this change appears to remove the primary ingestion trigger when a new episode is created.
Impact: creating an The focused Django test slice passed, but this behavior is no longer covered because |
|
Medium: admin-triggered pipeline/recovery work is now launched via raw daemon threads, which makes the "queued" behavior non-durable. The reprocess action starts Impact: if the Django process reloads, exits, or the worker is recycled after the admin response returns, daemon threads can be terminated immediately and the background work is silently lost. That is a reliability regression from the previous task-queue model, because the UI reports success ( The current tests only mock |
Generated by `langgraph dev` and should not be tracked. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Add save_model() to EpisodeAdmin to auto-start the ingestion pipeline when a new episode is created, restoring the behavior removed with the Django Q2 to LangGraph migration. - Remove daemon=True from background threads (reprocess, recovery retry, and new auto-start) so work completes during graceful process shutdown instead of being silently killed. - Add test for auto-ingestion on episode create. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
Fixed in 9134c90.
|
|
Fixed in 9134c90. Removed This doesn't provide full persistence across hard kills or crashes — that would require re-introducing a task queue. But it eliminates the most common failure mode (work lost on reload) while keeping the implementation simple. |
Extract Jaeger setup from inline step to its own subsection with Docker command and .env config. Add comprehensive Langfuse subsection covering self-hosted setup (with port conflict tip), API key generation, base64 auth encoding, and Langfuse Cloud regional endpoints. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Remove RAGTIME_OTEL_ENABLED flag — tracing now activates automatically when RAGTIME_OTEL_EXPORTER is set to 'otlp' or 'console'. When unset or 'none', the OTel API provides a no-op tracer with zero overhead. Move opentelemetry-exporter-otlp-proto-http from optional to main dependencies and remove the 'observability' extras group entirely. For a RAG app chaining multiple LLM calls, observability is essential infrastructure, not an optional add-on. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
High: the auto-start-on-create fix still has a transaction race. In threading.Thread(
target=_run_pipeline_task,
args=(obj.pk,),
).start()This thread uses a separate DB connection, so it can run before the admin request's transaction commits and fail to see the newly created The safe pattern here is to schedule the thread from The new test does not catch this because it mocks |
Use transaction.on_commit() to defer the background thread until the Episode row is committed and visible to other DB connections. Without this, the worker thread could run before the admin request's transaction commits and fail with Episode.DoesNotExist. Update test to use captureOnCommitCallbacks(execute=True) to verify the on_commit callback fires correctly. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
Fixed in b16c617. Wrapped the background thread start in |
Summary
RAGTIME_OTEL_*env vars replaceRAGTIME_LANGFUSE_*.StateGraphthat supports autonomous step skipping, recovery routing, and resume-from-failure. Removedjango-q2dependency entirely.langgraph.jsonand server module for local graph visualization via Studio desktop app.Changes
Phase 1 — OpenTelemetry
episodes/telemetry.py(@trace_step,@trace_provider,record_llm_input/output)episodes/observability.py(256 lines of Langfuse-specific code)Phase 2 — LangGraph Pipeline
episodes/graph/package (state, nodes, edges, pipeline, run — 7 new files)handle_step_failure_from_graph()for graph-initiated recoveryqueue_next_stepsignal handler, Django Q2,Q_CLUSTERconfigthreading.Thread+run_pipeline()Phase 3 — LangGraph Studio
langgraph.json+episodes/graph/server.pyfor Studio integrationDocumentation
Test plan
uv run python manage.py test— all 211 tests pass (9 skipped for optional deps)uv run python manage.py check— no issuesRAGTIME_OTEL_EXPORTER=consoleand verify spans in terminallanggraph devstarts and Studio shows the graph🤖 Generated with Claude Code