Skip to content

Lazy file-backed output for code judge payloads #306

@christso

Description

@christso

Problem

Code judges receive the full output Message[] array via stdin for every invocation. For long-running agent sessions (Claude Code, Copilot), this array can be 1-10 MB (50+ turns with full file contents in tool outputs). With multiple code judges per test case and parallel workers, this creates redundant serialization:

  • 3 workers × 10 MB output × 3 judges = 90 MB of redundant JSON serialized, piped, and parsed

Most code judges don't even need the full output — they inspect trace (TraceSummary stats) or answer (string).

Current Flow

orchestrator → JSON.stringify(full payload with output[]) → stdin pipe → judge process → JSON.parse

code-evaluator.ts:41-57 always includes output: context.output ?? null in the stdin payload regardless of whether the judge uses it.

Proposed: File-Backed Lazy Loading

Write large fields to a temp file once per test case. Pass the file path in the stdin payload. The @agentv/eval SDK reads the file transparently when the judge accesses the field.

Orchestrator changes (code-evaluator.ts)

// Write output to temp file once (shared across all judges for this test case)
const outputPath = await writeTempOutput(context.output);

const payload = {
  question: context.evalCase.question,
  answer: context.candidate,
  trace: context.trace ?? null,        // small — always in stdin
  output: null,                         // no longer in stdin
  _outputPath: outputPath,              // file path for lazy loading
  input: context.evalCase.input,
  // ... rest unchanged
};

SDK changes (@agentv/eval runtime.ts)

// Transparent lazy loading — judge code unchanged
const camelInput = toCamelCaseDeep(rawInput);

// If _outputPath present and output is null, create lazy getter
if (camelInput._outputPath && camelInput.output === null) {
  Object.defineProperty(camelInput, 'output', {
    get: () => {
      const data = JSON.parse(readFileSync(camelInput._outputPath, 'utf8'));
      // Cache after first read
      Object.defineProperty(camelInput, 'output', { value: data });
      return data;
    },
    configurable: true,
  });
}

const input = CodeJudgeInputSchema.parse(camelInput);

Judge code — no changes needed

import { defineCodeJudge } from '@agentv/eval';

export default defineCodeJudge((input) => {
  // These are always in stdin (fast)
  const { trace, answer } = input;
  
  // This triggers lazy file read only if accessed
  const output = input.output;  // transparent — reads from file
});

What Changes

Field Current Proposed
answer, trace, criteria, config In stdin In stdin (unchanged)
output (Message[]) In stdin (always) Temp file, lazy loaded via SDK
input (Message[]) In stdin In stdin (usually small)
Serialization cost per test case N judges × full payload 1 file write + N judges × small payload
SDK input.output API Direct property Same API (lazy getter, cached after first read)
Raw stdin judges (no SDK) Get output in stdin Get _outputPath + null output — must read file manually

Acceptance Criteria

  • Orchestrator writes output Message[] to temp file once per test case (before running judges)
  • Stdin payload sends _outputPath instead of full output array
  • @agentv/eval SDK transparently reads from file when input.output is accessed
  • Lazy getter caches result (file read happens at most once per judge invocation)
  • Temp files cleaned up after all judges for a test case complete
  • Backward compat: if output is present in stdin (non-null), use it directly (no file read)
  • Raw stdin judges that don't use the SDK get _outputPath in payload — document migration path
  • Add input to file-backed loading if payload size warrants it (future)

Files to Modify

  • packages/core/src/evaluation/evaluators/code-evaluator.ts — write temp file, pass path
  • packages/eval/src/runtime.ts — lazy getter in runCodeJudge
  • packages/eval/src/schemas.ts — add _outputPath to schema
  • packages/core/src/evaluation/orchestrator.ts — manage temp file lifecycle (write before judges, cleanup after)

Priority

Low — Current approach works fine for typical eval payloads. This optimization matters when evaluating long-running agent sessions with large tool outputs (50+ turns, MB-scale Message arrays).

LLM Judge Note

Not affected — default LLM judge template doesn't include {{output}}. Custom templates that use {{output}} face a token cost problem (not serialization), which is a separate concern.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions