feat(otel): real-time span export during eval execution (streaming observability)

## Context

AgentV's OTel exporter currently calls `exportResult()` **once per test case after it completes**. For long-running agentic evaluations (Copilot CLI, Claude SDK sessions that take minutes), the user sees nothing in Langfuse/Braintrust until the entire test case finishes.

The Braintrust `trace-claude-code` plugin exports spans in real-time via lifecycle hooks — each tool call and LLM response appears immediately. AgentV should do the same during eval runs.

## Current flow (batch)

```
Test case starts → agent runs (2-10 min) → test case ends → exportResult() → all spans sent at once
```

In `run-eval.ts:485-493`:
```typescript
onResult: async (result: EvaluationResult) => {
  // Only called AFTER the test case completes
  if (traceWriter && result.output && result.output.length > 0) {
    const traceRecord = buildTraceRecord(result.testId, result.output, { ... });
    await traceWriter.append(traceRecord);
  }
}
```

The OTel exporter is called in the same `onResult` callback — all spans for a test case are created and exported together after completion.

## Proposed flow (streaming)

```
Test case starts → root span created immediately
  → tool call completes → tool span exported in real-time
  → LLM response arrives → LLM span exported in real-time
  → another tool call → another span exported
Test case ends → root span finalized with score/verdict
```

## Implementation

### 1. Add streaming callbacks to provider interface

The provider (`ProviderAdapter`) needs to emit events as execution progresses, not just return a final result:

```typescript
// In packages/core/src/evaluation/providers/types.ts
export interface ProviderStreamCallbacks {
  onToolCallStart?: (toolName: string, input: unknown) => void;
  onToolCallEnd?: (toolName: string, input: unknown, output: unknown, durationMs: number) => void;
  onLlmCallEnd?: (model: string, tokenUsage: ProviderTokenUsage) => void;
}

export interface ProviderAdapter {
  query(input: string | Message[], options?: {
    stream?: ProviderStreamCallbacks;
  }): AsyncIterable<...> | Promise<...>;
}
```

### 2. Create a streaming OTel observer

```typescript
// In packages/core/src/observability/otel-exporter.ts
export class OtelStreamingObserver {
  private rootSpan: Span;
  private tracer: Tracer;

  startEvalCase(testId: string, target: string): void {
    // Create root span immediately — visible in Langfuse right away
    this.rootSpan = this.tracer.startSpan('agentv.eval', { ... });
    this.rootSpan.setAttribute('agentv.test_id', testId);
  }

  onToolCall(name: string, input: unknown, output: unknown, durationMs: number): void {
    // Create and immediately end (export) a tool span
    const toolSpan = this.tracer.startSpan(`execute_tool ${name}`, {
      parent: this.rootSpan,
    });
    toolSpan.setAttribute('gen_ai.tool.name', name);
    // ... set input/output attributes
    toolSpan.end(); // → SimpleSpanProcessor exports immediately
  }

  onLlmCall(model: string, tokenUsage: ProviderTokenUsage): void {
    const llmSpan = this.tracer.startSpan(`chat ${model}`, {
      parent: this.rootSpan,
    });
    llmSpan.setAttribute('gen_ai.usage.input_tokens', tokenUsage.inputTokens);
    llmSpan.setAttribute('gen_ai.usage.output_tokens', tokenUsage.outputTokens);
    llmSpan.end(); // → exported immediately
  }

  finalizeEvalCase(score: number, verdict: string): void {
    this.rootSpan.setAttribute('agentv.score', score);
    this.rootSpan.end(); // → final export
  }
}
```

### 3. Wire into orchestrator

In the orchestrator's eval execution loop, pass the streaming observer to the provider:

```typescript
// In packages/core/src/evaluation/orchestrator.ts
const observer = otelExporter ? new OtelStreamingObserver(otelExporter) : undefined;
observer?.startEvalCase(testCase.id, target.name);

const result = await provider.query(input, {
  stream: observer ? {
    onToolCallEnd: (name, input, output, ms) => observer.onToolCall(name, input, output, ms),
    onLlmCallEnd: (model, usage) => observer.onLlmCall(model, usage),
  } : undefined,
});

// After scoring
observer?.finalizeEvalCase(result.score, result.verdict);
```

### 4. Provider implementation

Each provider needs to call the stream callbacks. Most providers already capture tool calls in their event loops — they just need to invoke the callback:

**Claude SDK** (`claude.ts`): The SDK query async iterator yields `assistant` messages with tool calls. Call `onToolCallEnd` after each tool result message, `onLlmCallEnd` after each assistant message with usage.

**Copilot CLI** (`copilot-cli.ts`): ACP `sessionUpdate` events already fire for `tool_call`, `tool_call_update`, `usage_update`. Call callbacks from the existing event handlers.

**Copilot SDK** (`copilot-sdk.ts`): Similar to CLI — callback from SDK event stream.

## Files to modify

1. **`packages/core/src/evaluation/providers/types.ts`** — Add `ProviderStreamCallbacks` interface
2. **`packages/core/src/observability/otel-exporter.ts`** — Add `OtelStreamingObserver` class
3. **`packages/core/src/evaluation/orchestrator.ts`** — Wire observer to provider
4. **`packages/core/src/evaluation/providers/claude.ts`** — Emit stream callbacks
5. **`packages/core/src/evaluation/providers/copilot-cli.ts`** — Emit stream callbacks
6. **`packages/core/src/evaluation/providers/copilot-sdk.ts`** — Emit stream callbacks
7. **Tests** — Verify spans appear during execution, not just after

## Acceptance criteria

- [ ] Root eval span appears in Langfuse immediately when a test case starts
- [ ] Tool spans appear in Langfuse as each tool call completes (not batched)
- [ ] LLM spans appear as each model response arrives
- [ ] Root span is finalized with score/verdict after evaluation completes
- [ ] Works with `--otel-backend langfuse` and `--otel-backend braintrust`
- [ ] Works with `--otel-file` (#304) — spans written as they complete
- [ ] Providers without streaming support (mock, cli) fall back to batch export (current behavior)
- [ ] No performance regression — callbacks are async, non-blocking
- [ ] Skills and docs updated

## Dependencies

- Should use GenAI conventions from #298 for span/attribute names
- Compatible with `--otel-file` from #304 (file exporter also receives streaming spans)
- Part of tracking issue #303

## References

- Current batch export: `apps/cli/src/commands/eval/run-eval.ts:485-493`
- Orchestrator: `packages/core/src/evaluation/orchestrator.ts`
- OTel `SimpleSpanProcessor`: exports spans on `.end()` — no batching delay

## Testing Approach

### Unit Tests (InMemorySpanExporter)
```typescript
const exporter = new InMemorySpanExporter();

// Test 1: Spans created during execution (not after)
// Use SimpleSpanProcessor (synchronous export on span.end())
// Run eval with mock provider that has deliberate delay
// Assert: spans appear in exporter BEFORE onResult callback fires

// Test 2: Tool spans have correct timing
const spans = exporter.getFinishedSpans();
const toolSpans = spans.filter(s => s.attributes['gen_ai.operation.name'] === 'tool');
for (const span of toolSpans) {
  expect(span.startTime).toBeDefined();
  expect(span.endTime).toBeDefined();
  // Duration should be positive
  const durationNs = Number(span.endTime[0] - span.startTime[0]) * 1e9 + (span.endTime[1] - span.startTime[1]);
  expect(durationNs).toBeGreaterThan(0);
}

// Test 3: LLM spans created per turn
const llmSpans = spans.filter(s => s.attributes['gen_ai.operation.name'] === 'chat');
expect(llmSpans.length).toBeGreaterThan(0);
```

### Integration Test (Jaeger real-time)
```bash
docker run -d -p 16686:16686 -p 4318:4318 jaegertracing/all-in-one:latest

# Run a slow eval (mock provider with delays)
agentv eval examples/features/tool-trajectory-simple/evals/dataset.eval.yaml --target mock_agent --export-otel

# While running: open Jaeger http://localhost:16686
# Spans should appear in Jaeger BEFORE the eval completes
```

### What to Assert
- [ ] Spans exported in real-time (visible in Jaeger during execution, not after)
- [ ] Each provider turn creates spans as tool calls complete
- [ ] Token usage attributes populated per-span (not just on root)
- [ ] No regression: batch export at end still works for providers that don't support streaming
- [ ] SimpleSpanProcessor used (immediate export), not BatchSpanProcessor

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(otel): real-time span export during eval execution (streaming observability) #305

Context

Current flow (batch)

Proposed flow (streaming)

Implementation

1. Add streaming callbacks to provider interface

2. Create a streaming OTel observer

3. Wire into orchestrator

4. Provider implementation

Files to modify

Acceptance criteria

Dependencies

References

Testing Approach

Unit Tests (InMemorySpanExporter)

Integration Test (Jaeger real-time)

What to Assert

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

feat(otel): real-time span export during eval execution (streaming observability) #305

Description

Context

Current flow (batch)

Proposed flow (streaming)

Implementation

1. Add streaming callbacks to provider interface

2. Create a streaming OTel observer

3. Wire into orchestrator

4. Provider implementation

Files to modify

Acceptance criteria

Dependencies

References

Testing Approach

Unit Tests (InMemorySpanExporter)

Integration Test (Jaeger real-time)

What to Assert

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions