tracking: OTel observability — GenAI conventions, richer traces, session tracing

## Overview

Parent issue tracking 8 OTel observability and eval infrastructure improvements identified from deep analysis of [braintrustdata/braintrust-claude-plugin](https://github.com/braintrustdata/braintrust-claude-plugin). AgentV already has a working OTel exporter (#277). These issues make it production-grade and interoperable.

## Architecture Alignment Review

Each issue was reviewed against AgentV's 5 design principles (CLAUDE.md). Key finding: **#300 should be implemented as a plugin, not core code**, per Principle 1 (Lightweight Core, Plugin Extensibility).

| Issue | Title | Verdict | Adjustment |
|-------|-------|---------|------------|
| #298 | Adopt OTel GenAI semantic conventions | **ALIGNED** | None — standards alignment (P3) |
| #299 | Per-span token usage in OTel export | **ALIGNED** | None — universal primitive (P2) |
| #300 | Claude Code session tracing plugin | **NEEDS ADJUSTMENT** | Implement as plugin, not core (P1) |
| #301 | Trace composition / parent span linking | **ALIGNED** | None — W3C standard (P3) |
| #302 | Turn-level span grouping | **ALIGNED** | Opt-in via `--otel-group-turns` flag |
| #304 | Replace proprietary trace JSONL with OTLP JSON file export | **ALIGNED** | Removes dead code, consolidates to one schema (P3, P1) |
| #305 | Real-time span export during eval execution | **ALIGNED** | Core streaming primitive needed by all providers (P2) |
| #306 | Lazy file-backed output for code judge payloads | **ALIGNED** | Performance optimization for large Message[] payloads (P2) |

### Principle Analysis

**P1 (Lightweight Core)**: #300 is Claude Code-specific → plugin. #304 *reduces* core surface by deleting `TraceWriter`. Everything else is core OTel infrastructure.

**P2 (Primitives Only)**: #298, #299, #305 expose existing data through standard interfaces — universal, stateless, needed by majority. #306 optimizes how existing data is passed to judges.

**P3 (Industry Standards)**: #298 = OTel GenAI conventions. #301 = W3C Trace Context. #304 = OTLP JSON spec.

## Phased Delivery

### Phase 1: Core attribute improvements (~3-4 days)
- [ ] #298 — OTel GenAI semantic conventions (**gates all other issues**)
- [ ] #299 — Per-span token usage metrics

### Phase 2: Trace structure + file format (~1 week)
- [ ] #301 — W3C traceparent propagation for trace composition
- [ ] #302 — Opt-in turn-level span grouping (`--otel-group-turns`)
- [ ] #304 — OTLP JSON file export (`--otel-file`), remove `--trace`

### Phase 3: Streaming + plugin (~2-3 weeks)
- [ ] #305 — Real-time span export during eval execution (streaming observability)
- [ ] #300 — Claude Code session tracing plugin (uses core exporter, lives outside core)

### Phase 4: Eval infrastructure optimization
- [ ] #306 — Lazy file-backed output for code judge payloads (large Message[] performance)

## Dependency Graph

```
#298 (GenAI conventions) ← gates everything
 ├── #299 (per-span tokens — uses GenAI attribute names)
 ├── #302 (turn grouping — spans use GenAI conventions)
 ├── #304 (OTLP JSON file — writes GenAI-convention spans to disk)
 ├── #305 (streaming export — creates GenAI-convention spans in real-time)
 └── #300 (session plugin — exports GenAI-convention spans via hooks)

#301 (trace composition — independent, can parallel with Phase 1)

#305 (streaming) ← #300 depends on this for real-time session tracing

#306 (lazy output) — independent, can be done anytime
```

## Key Architectural Boundary

**Core** (`packages/core/src/observability/`): OTel exporter, attribute mapping, trace composition, span hierarchy, OTLP JSON file writer, streaming observer.

**Delete** (`apps/cli/src/commands/eval/trace-writer.ts`): Proprietary `TraceWriter`, `buildTraceRecord`, `extractTraceSpans` — replaced by #304.

**Plugin** (`plugins/agentv-trace/`): Claude Code hook wiring, session state management, transcript parsing. Opens the door for similar plugins for Copilot CLI, Codex, etc.

**SDK** (`packages/eval/`): Lazy file-backed loading for large payloads (#306) — transparent to judge authors.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tracking: OTel observability — GenAI conventions, richer traces, session tracing #303

Overview

Architecture Alignment Review

Principle Analysis

Phased Delivery

Phase 1: Core attribute improvements (~3-4 days)

Phase 2: Trace structure + file format (~1 week)

Phase 3: Streaming + plugin (~2-3 weeks)

Phase 4: Eval infrastructure optimization

Dependency Graph

Key Architectural Boundary

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue	Title	Verdict	Adjustment
#298	Adopt OTel GenAI semantic conventions	ALIGNED	None — standards alignment (P3)
#299	Per-span token usage in OTel export	ALIGNED	None — universal primitive (P2)
#300	Claude Code session tracing plugin	NEEDS ADJUSTMENT	Implement as plugin, not core (P1)
#301	Trace composition / parent span linking	ALIGNED	None — W3C standard (P3)
#302	Turn-level span grouping	ALIGNED	Opt-in via `--otel-group-turns` flag
#304	Replace proprietary trace JSONL with OTLP JSON file export	ALIGNED	Removes dead code, consolidates to one schema (P3, P1)
#305	Real-time span export during eval execution	ALIGNED	Core streaming primitive needed by all providers (P2)
#306	Lazy file-backed output for code judge payloads	ALIGNED	Performance optimization for large Message[] payloads (P2)

tracking: OTel observability — GenAI conventions, richer traces, session tracing #303

Description

Overview

Architecture Alignment Review

Principle Analysis

Phased Delivery

Phase 1: Core attribute improvements (~3-4 days)

Phase 2: Trace structure + file format (~1 week)

Phase 3: Streaming + plugin (~2-3 weeks)

Phase 4: Eval infrastructure optimization

Dependency Graph

Key Architectural Boundary

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions