Skip to content

feat(otel): real-time span export during eval execution (streaming observability) #305

@christso

Description

@christso

Context

AgentV's OTel exporter currently calls exportResult() once per test case after it completes. For long-running agentic evaluations (Copilot CLI, Claude SDK sessions that take minutes), the user sees nothing in Langfuse/Braintrust until the entire test case finishes.

The Braintrust trace-claude-code plugin exports spans in real-time via lifecycle hooks — each tool call and LLM response appears immediately. AgentV should do the same during eval runs.

Current flow (batch)

Test case starts → agent runs (2-10 min) → test case ends → exportResult() → all spans sent at once

In run-eval.ts:485-493:

onResult: async (result: EvaluationResult) => {
  // Only called AFTER the test case completes
  if (traceWriter && result.output && result.output.length > 0) {
    const traceRecord = buildTraceRecord(result.testId, result.output, { ... });
    await traceWriter.append(traceRecord);
  }
}

The OTel exporter is called in the same onResult callback — all spans for a test case are created and exported together after completion.

Proposed flow (streaming)

Test case starts → root span created immediately
  → tool call completes → tool span exported in real-time
  → LLM response arrives → LLM span exported in real-time
  → another tool call → another span exported
Test case ends → root span finalized with score/verdict

Implementation

1. Add streaming callbacks to provider interface

The provider (ProviderAdapter) needs to emit events as execution progresses, not just return a final result:

// In packages/core/src/evaluation/providers/types.ts
export interface ProviderStreamCallbacks {
  onToolCallStart?: (toolName: string, input: unknown) => void;
  onToolCallEnd?: (toolName: string, input: unknown, output: unknown, durationMs: number) => void;
  onLlmCallEnd?: (model: string, tokenUsage: ProviderTokenUsage) => void;
}

export interface ProviderAdapter {
  query(input: string | Message[], options?: {
    stream?: ProviderStreamCallbacks;
  }): AsyncIterable<...> | Promise<...>;
}

2. Create a streaming OTel observer

// In packages/core/src/observability/otel-exporter.ts
export class OtelStreamingObserver {
  private rootSpan: Span;
  private tracer: Tracer;

  startEvalCase(testId: string, target: string): void {
    // Create root span immediately — visible in Langfuse right away
    this.rootSpan = this.tracer.startSpan('agentv.eval', { ... });
    this.rootSpan.setAttribute('agentv.test_id', testId);
  }

  onToolCall(name: string, input: unknown, output: unknown, durationMs: number): void {
    // Create and immediately end (export) a tool span
    const toolSpan = this.tracer.startSpan(`execute_tool ${name}`, {
      parent: this.rootSpan,
    });
    toolSpan.setAttribute('gen_ai.tool.name', name);
    // ... set input/output attributes
    toolSpan.end(); // → SimpleSpanProcessor exports immediately
  }

  onLlmCall(model: string, tokenUsage: ProviderTokenUsage): void {
    const llmSpan = this.tracer.startSpan(`chat ${model}`, {
      parent: this.rootSpan,
    });
    llmSpan.setAttribute('gen_ai.usage.input_tokens', tokenUsage.inputTokens);
    llmSpan.setAttribute('gen_ai.usage.output_tokens', tokenUsage.outputTokens);
    llmSpan.end(); // → exported immediately
  }

  finalizeEvalCase(score: number, verdict: string): void {
    this.rootSpan.setAttribute('agentv.score', score);
    this.rootSpan.end(); // → final export
  }
}

3. Wire into orchestrator

In the orchestrator's eval execution loop, pass the streaming observer to the provider:

// In packages/core/src/evaluation/orchestrator.ts
const observer = otelExporter ? new OtelStreamingObserver(otelExporter) : undefined;
observer?.startEvalCase(testCase.id, target.name);

const result = await provider.query(input, {
  stream: observer ? {
    onToolCallEnd: (name, input, output, ms) => observer.onToolCall(name, input, output, ms),
    onLlmCallEnd: (model, usage) => observer.onLlmCall(model, usage),
  } : undefined,
});

// After scoring
observer?.finalizeEvalCase(result.score, result.verdict);

4. Provider implementation

Each provider needs to call the stream callbacks. Most providers already capture tool calls in their event loops — they just need to invoke the callback:

Claude SDK (claude.ts): The SDK query async iterator yields assistant messages with tool calls. Call onToolCallEnd after each tool result message, onLlmCallEnd after each assistant message with usage.

Copilot CLI (copilot-cli.ts): ACP sessionUpdate events already fire for tool_call, tool_call_update, usage_update. Call callbacks from the existing event handlers.

Copilot SDK (copilot-sdk.ts): Similar to CLI — callback from SDK event stream.

Files to modify

  1. packages/core/src/evaluation/providers/types.ts — Add ProviderStreamCallbacks interface
  2. packages/core/src/observability/otel-exporter.ts — Add OtelStreamingObserver class
  3. packages/core/src/evaluation/orchestrator.ts — Wire observer to provider
  4. packages/core/src/evaluation/providers/claude.ts — Emit stream callbacks
  5. packages/core/src/evaluation/providers/copilot-cli.ts — Emit stream callbacks
  6. packages/core/src/evaluation/providers/copilot-sdk.ts — Emit stream callbacks
  7. Tests — Verify spans appear during execution, not just after

Acceptance criteria

  • Root eval span appears in Langfuse immediately when a test case starts
  • Tool spans appear in Langfuse as each tool call completes (not batched)
  • LLM spans appear as each model response arrives
  • Root span is finalized with score/verdict after evaluation completes
  • Works with --otel-backend langfuse and --otel-backend braintrust
  • Works with --otel-file (feat(otel): replace proprietary trace JSONL with OTLP JSON file export (--otel-file) #304) — spans written as they complete
  • Providers without streaming support (mock, cli) fall back to batch export (current behavior)
  • No performance regression — callbacks are async, non-blocking
  • Skills and docs updated

Dependencies

References

  • Current batch export: apps/cli/src/commands/eval/run-eval.ts:485-493
  • Orchestrator: packages/core/src/evaluation/orchestrator.ts
  • OTel SimpleSpanProcessor: exports spans on .end() — no batching delay

Testing Approach

Unit Tests (InMemorySpanExporter)

const exporter = new InMemorySpanExporter();

// Test 1: Spans created during execution (not after)
// Use SimpleSpanProcessor (synchronous export on span.end())
// Run eval with mock provider that has deliberate delay
// Assert: spans appear in exporter BEFORE onResult callback fires

// Test 2: Tool spans have correct timing
const spans = exporter.getFinishedSpans();
const toolSpans = spans.filter(s => s.attributes['gen_ai.operation.name'] === 'tool');
for (const span of toolSpans) {
  expect(span.startTime).toBeDefined();
  expect(span.endTime).toBeDefined();
  // Duration should be positive
  const durationNs = Number(span.endTime[0] - span.startTime[0]) * 1e9 + (span.endTime[1] - span.startTime[1]);
  expect(durationNs).toBeGreaterThan(0);
}

// Test 3: LLM spans created per turn
const llmSpans = spans.filter(s => s.attributes['gen_ai.operation.name'] === 'chat');
expect(llmSpans.length).toBeGreaterThan(0);

Integration Test (Jaeger real-time)

docker run -d -p 16686:16686 -p 4318:4318 jaegertracing/all-in-one:latest

# Run a slow eval (mock provider with delays)
agentv eval examples/features/tool-trajectory-simple/evals/dataset.eval.yaml --target mock_agent --export-otel

# While running: open Jaeger http://localhost:16686
# Spans should appear in Jaeger BEFORE the eval completes

What to Assert

  • Spans exported in real-time (visible in Jaeger during execution, not after)
  • Each provider turn creates spans as tool calls complete
  • Token usage attributes populated per-span (not just on root)
  • No regression: batch export at end still works for providers that don't support streaming
  • SimpleSpanProcessor used (immediate export), not BatchSpanProcessor

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions