Lazy file-backed output for code judge payloads

## Problem

Code judges receive the full `output` Message[] array via stdin for every invocation. For long-running agent sessions (Claude Code, Copilot), this array can be 1-10 MB (50+ turns with full file contents in tool outputs). With multiple code judges per test case and parallel workers, this creates redundant serialization:

- 3 workers × 10 MB output × 3 judges = **90 MB of redundant JSON** serialized, piped, and parsed

Most code judges don't even need the full output — they inspect `trace` (TraceSummary stats) or `answer` (string).

## Current Flow

```
orchestrator → JSON.stringify(full payload with output[]) → stdin pipe → judge process → JSON.parse
```

`code-evaluator.ts:41-57` always includes `output: context.output ?? null` in the stdin payload regardless of whether the judge uses it.

## Proposed: File-Backed Lazy Loading

Write large fields to a temp file once per test case. Pass the file path in the stdin payload. The `@agentv/eval` SDK reads the file transparently when the judge accesses the field.

### Orchestrator changes (`code-evaluator.ts`)

```typescript
// Write output to temp file once (shared across all judges for this test case)
const outputPath = await writeTempOutput(context.output);

const payload = {
  question: context.evalCase.question,
  answer: context.candidate,
  trace: context.trace ?? null,        // small — always in stdin
  output: null,                         // no longer in stdin
  _outputPath: outputPath,              // file path for lazy loading
  input: context.evalCase.input,
  // ... rest unchanged
};
```

### SDK changes (`@agentv/eval` runtime.ts)

```typescript
// Transparent lazy loading — judge code unchanged
const camelInput = toCamelCaseDeep(rawInput);

// If _outputPath present and output is null, create lazy getter
if (camelInput._outputPath && camelInput.output === null) {
  Object.defineProperty(camelInput, 'output', {
    get: () => {
      const data = JSON.parse(readFileSync(camelInput._outputPath, 'utf8'));
      // Cache after first read
      Object.defineProperty(camelInput, 'output', { value: data });
      return data;
    },
    configurable: true,
  });
}

const input = CodeJudgeInputSchema.parse(camelInput);
```

### Judge code — no changes needed

```typescript
import { defineCodeJudge } from '@agentv/eval';

export default defineCodeJudge((input) => {
  // These are always in stdin (fast)
  const { trace, answer } = input;
  
  // This triggers lazy file read only if accessed
  const output = input.output;  // transparent — reads from file
});
```

## What Changes

| Field | Current | Proposed |
|---|---|---|
| `answer`, `trace`, `criteria`, `config` | In stdin | In stdin (unchanged) |
| `output` (Message[]) | In stdin (always) | Temp file, lazy loaded via SDK |
| `input` (Message[]) | In stdin | In stdin (usually small) |
| Serialization cost per test case | N judges × full payload | 1 file write + N judges × small payload |
| SDK `input.output` API | Direct property | Same API (lazy getter, cached after first read) |
| Raw stdin judges (no SDK) | Get output in stdin | Get `_outputPath` + null output — must read file manually |

## Acceptance Criteria

- [ ] Orchestrator writes `output` Message[] to temp file once per test case (before running judges)
- [ ] Stdin payload sends `_outputPath` instead of full `output` array
- [ ] `@agentv/eval` SDK transparently reads from file when `input.output` is accessed
- [ ] Lazy getter caches result (file read happens at most once per judge invocation)
- [ ] Temp files cleaned up after all judges for a test case complete
- [ ] Backward compat: if `output` is present in stdin (non-null), use it directly (no file read)
- [ ] Raw stdin judges that don't use the SDK get `_outputPath` in payload — document migration path
- [ ] Add `input` to file-backed loading if payload size warrants it (future)

## Files to Modify

- `packages/core/src/evaluation/evaluators/code-evaluator.ts` — write temp file, pass path
- `packages/eval/src/runtime.ts` — lazy getter in `runCodeJudge`
- `packages/eval/src/schemas.ts` — add `_outputPath` to schema
- `packages/core/src/evaluation/orchestrator.ts` — manage temp file lifecycle (write before judges, cleanup after)

## Priority

**Low** — Current approach works fine for typical eval payloads. This optimization matters when evaluating long-running agent sessions with large tool outputs (50+ turns, MB-scale Message arrays).

## LLM Judge Note

Not affected — default LLM judge template doesn't include `{{output}}`. Custom templates that use `{{output}}` face a token cost problem (not serialization), which is a separate concern.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Lazy file-backed output for code judge payloads #306

Problem

Current Flow

Proposed: File-Backed Lazy Loading

Orchestrator changes (`code-evaluator.ts`)

SDK changes (`@agentv/eval` runtime.ts)

Judge code — no changes needed

What Changes

Acceptance Criteria

Files to Modify

Priority

LLM Judge Note

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Field	Current	Proposed
`answer`, `trace`, `criteria`, `config`	In stdin	In stdin (unchanged)
`output` (Message[])	In stdin (always)	Temp file, lazy loaded via SDK
`input` (Message[])	In stdin	In stdin (usually small)
Serialization cost per test case	N judges × full payload	1 file write + N judges × small payload
SDK `input.output` API	Direct property	Same API (lazy getter, cached after first read)
Raw stdin judges (no SDK)	Get output in stdin	Get `_outputPath` + null output — must read file manually

Lazy file-backed output for code judge payloads #306

Description

Problem

Current Flow

Proposed: File-Backed Lazy Loading

Orchestrator changes (code-evaluator.ts)

SDK changes (@agentv/eval runtime.ts)

Judge code — no changes needed

What Changes

Acceptance Criteria

Files to Modify

Priority

LLM Judge Note

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Orchestrator changes (`code-evaluator.ts`)

SDK changes (`@agentv/eval` runtime.ts)