Skip to content

PRD: Codex deep baseline and evidence-scoped capability reports #10

@Jordak

Description

@Jordak

Problem Statement

Agent Eval Lab has grown from a starter harness into the early shape of a credible coding-agent evaluation lab. It can run tasks, invoke Codex CLI, capture outcomes, verify reference artifacts, summarize repeated trials, and record human reviews. However, the project still needs a coherent next-phase product plan so work does not scatter across task curation, agent integration, reporting, environment setup, and reliability metrics.

From the user's perspective, the core problem is: "I want to understand, with evidence, what coding-agent harnesses do well, poorly, or inconsistently on realistic software engineering tasks, starting with Codex CLI. I want the output to be useful to technical evaluators and to solo developers deciding which AI coding tool fits their use case. I do not want the lab to overgeneralize beyond observed evidence."

The current state has also surfaced important operational lessons:

  • New task or grader behavior must be smoke-tested with one trial and one job before scaling to repeated or parallel trials.
  • Environment failures should be diagnosed as environment failures, not hidden behind app-level workarounds.
  • Task-local environments must make likely self-check tools available to both the agent harness and deterministic graders.
  • Reports must distinguish deterministic grader outcomes from human review outcomes.
  • Repeated-trial summaries must avoid mixing invalid setup/dependency failures into fair capability claims.

Solution

Build the next phase around a Codex deep baseline for the Solo Dev Starter Suite.

The lab should curate realistic task bundles, verify each task with a reference artifact, smoke-test each task with one Codex trial, then run bounded repeated Codex trials only after the single-trial path is fair. The output should be an evidence-scoped Markdown agent capability report that explains what Codex CLI did well, poorly, and inconsistently under the evaluated conditions.

The solution should continue to follow Anthropic-aligned terminology: tasks, trials, evaluation suites, agent harnesses, graders/assertions, traces/transcripts, outcomes, reference verification, and pass@k/pass^k. It should keep "agent harness" separate from "underlying model" and avoid global claims such as "model Y has capability Z" unless the evidence supports that scope.

The project should also strengthen the harness around invalid/excluded trials, task-local environments, outcome evidence, and report generation so future task batches are more trustworthy and easier to interpret.

User Stories

  1. As a technical evaluator, I want each coding task to pin a real repository and commit, so that trial results are reproducible.
  2. As a technical evaluator, I want each task to live in a task bundle, so that metadata, generated task cards, and reference artifacts stay together.
  3. As a task curator, I want task cards generated from task metadata, so that human-readable docs cannot drift silently from the source of truth.
  4. As a task curator, I want suite indexes generated from task bundles, so that humans and AI assistants can quickly scan the available tasks.
  5. As a task curator, I want a pre-commit check for task-card drift, so that generated review artifacts stay aligned before publishing.
  6. As a task curator, I want each publishable task to include a verified reference artifact, so that the task is proven solvable before agents are evaluated.
  7. As a task curator, I want reference verification to produce the same kind of report/result artifact as agent trials, so that reference outcomes can be inspected consistently.
  8. As a task curator, I want reference verification to enforce success criteria such as max files changed, so that reference artifacts remain focused.
  9. As a task curator, I want task-local setup commands, so that each task can provision its own dependencies.
  10. As a task curator, I want task-local environment variables and PATH entries, so that agents and graders use the same task tools.
  11. As a task curator, I want task-local environments to expose obvious self-check tools like pytest, so that agent harnesses can run natural validation commands.
  12. As a maintainer, I want new task environments smoke-tested with one trial and one job, so that unfair setup problems are caught before spending repeated trials.
  13. As a maintainer, I want parallel trial execution after smoke tests pass, so that repeated trials do not take unnecessarily long.
  14. As a maintainer, I want successful multi-trial batches to print only aggregate summaries, so that terminal output remains readable.
  15. As a maintainer, I want failed multi-trial batches to print failed trial IDs and report paths, so that I can quickly inspect the relevant artifacts.
  16. As a maintainer, I want agent launch errors printed to stderr, so that I do not need to open transcripts to discover basic adapter failures.
  17. As a maintainer, I want reusable terminal/error behavior in shared layers, so that adapter-specific classes stay focused.
  18. As a maintainer, I want environment problems fixed at the environment layer, so that portable harness behavior does not become a pile of machine-specific workarounds.
  19. As a maintainer, I want agent executable discovery to rely on PATH or explicit configuration, so that the lab remains portable.
  20. As a maintainer, I want no hard-coded user-machine paths, so that other contributors can run the lab.
  21. As a maintainer, I want task run IDs to be collision-resistant, so that concurrent trials cannot overwrite each other.
  22. As a maintainer, I want manual trials, reference verification, and Codex trials to share the same grader semantics, so that outcomes are comparable.
  23. As a maintainer, I want manual trials to remain available as positive and negative controls, so that new harness behavior can be validated before agent integrations.
  24. As an evaluator, I want Codex CLI invoked non-interactively as an agent harness adapter, so that Codex can be evaluated reproducibly.
  25. As an evaluator, I want Codex CLI traces, final messages, diffs, reports, and result metadata captured, so that I can inspect both trajectory and outcome.
  26. As an evaluator, I want the report to show changed files and deterministic assertions, so that I can understand exactly why a trial passed or failed.
  27. As an evaluator, I want result metadata to include trial kind, suite, eval type, task, agent harness, model when known, checks, outcome, duration, changed files, and errors, so that later summaries can be computed.
  28. As an evaluator, I want repeated trial summaries grouped by suite, eval type, task, agent harness, and model, so that aggregation compares like with like.
  29. As an evaluator, I want summaries to report pass rate, pass@k, and pass^k, so that I can distinguish "eventually succeeds" from "succeeds consistently."
  30. As an evaluator, I want invalid setup/dependency trials excluded or clearly labeled, so that pass rates do not misrepresent agent capability.
  31. As an evaluator, I want human review labels separate from deterministic grader status, so that code-passing but messy or risky patches can be recorded accurately.
  32. As an evaluator, I want review labels such as success_clean, success_messy, test_gap, over_edit, tool_misuse, dependency_issue, and resource_inefficient, so that failure modes are comparable across trials.
  33. As an evaluator, I want human review evidence attached to labels, so that review judgments are auditable.
  34. As an evaluator, I want edit size metrics such as files changed and eventually lines added/deleted, so that over-editing can be detected.
  35. As an evaluator, I want resource usage metrics such as duration, token usage, and cost when available, so that resource-inefficient behavior can be identified.
  36. As a model-quality engineer, I want reports to scope claims to evaluated conditions, so that the lab does not overclaim global model capability.
  37. As a model-quality engineer, I want transcript review available when deterministic graders fail, so that I can decide whether the grader rejected a valid solution.
  38. As a model-quality engineer, I want task prompts to be realistic but grader expectations unambiguous, so that failures are meaningful rather than prompt-noise artifacts.
  39. As a solo developer, I want reports to explain what an agent harness does well, poorly, and inconsistently, so that I can choose a tool that fits my project.
  40. As a solo developer, I want the report to be Markdown and AI-readable, so that I can ask another AI assistant to help interpret the evidence for my situation.
  41. As a solo developer, I want recommendations scoped to evidence, so that I can act on them without mistaking them for universal rankings.
  42. As a solo developer, I want tasks to reflect realistic maintenance work, so that the evaluation feels relevant to real projects.
  43. As a solo developer, I want at least one UI or visual task eventually included in the starter suite, so that visual/UX agent behavior is represented.
  44. As a future evaluator, I want the architecture to support interactive tasks later, so that tasks can allow follow-up questions when that behavior is part of the evaluation.
  45. As a future evaluator, I want follow-up-question quality to be gradable later, so that interactive agent behavior can be measured without redesigning the whole task schema.
  46. As a maintainer, I want GitHub Issues to hold PRDs and implementation tickets, so that project work is visible and ready for agents.
  47. As a maintainer, I want ready-for-agent issues to be well specified, so that an AFK coding agent can pick them up safely.
  48. As a maintainer, I want architectural decisions recorded as ADRs, so that later contributors understand why the harness behaves the way it does.
  49. As a maintainer, I want project-private career context kept out of public docs, so that public artifacts remain about the product and evaluation evidence.
  50. As a maintainer, I want all public docs to use the ubiquitous language consistently, so that agents and humans interpret the project the same way.

Implementation Decisions

  • The project will use "agent harness" as the comparison target term, not "agent tool", "agent application", or "agent harness" interchangeably. The underlying model is a separate dimension.
  • The first serious evaluation effort is the Codex deep baseline, not an immediate multi-harness comparison. Codex CLI is tested deeply first because it is installed and available, while other harnesses such as Claude Code or Cursor can be integrated later.
  • The first suite is the Solo Dev Starter Suite: realistic, small maintenance tasks relevant to solo developers and technical evaluators.
  • The suite should be mixed-language over time, but the initial concrete tasks are Python/Click regressions plus the existing 2048 capability task.
  • Each task is represented as a task bundle with a source task definition, generated task card, and optional reference artifacts.
  • Task definitions are the source of truth; task cards and suite indexes are generated review artifacts.
  • Reference artifacts can be patches or commits, but publishable tasks should have verified reference artifacts rather than prose-only reference notes.
  • Reference verification uses the same report/result shape as agent trials and is marked as reference verification.
  • The deterministic grader outcome remains separate from the human review outcome.
  • Code-based graders are the default for coding correctness. Transcript/tool-call graders should be added only when they measure behavior outcome graders cannot capture.
  • The runner applies task-local environment configuration to setup, baseline, agent, and target grader commands.
  • Task-local environments currently support workspace-relative PATH prepends and environment variables with a workspace placeholder.
  • Python tasks can create a local virtual environment during setup and expose its tools to both agent and graders.
  • For Click tasks, the environment must expose the source checkout explicitly to avoid import failures caused by editable-install path handling in paths containing spaces.
  • New task, grader, environment, or agent-harness behavior must be tested with one trial and one job before repeated or parallel trials.
  • Once a single-trial smoke test passes, repeated trials can use bounded concurrency through the jobs option.
  • Successful multi-trial batches should print compact aggregate summaries. Failed batches should identify failed trials and report/result paths.
  • Trial directories should be unique even for concurrent trials started in the same second.
  • Codex CLI should be discovered through PATH or configured explicitly. The lab should not guess machine-specific install paths.
  • Agent launch errors should surface in stderr through shared summary/error behavior, not only in transcripts.
  • Environment failures should be fixed in environment/task configuration, not worked around in child adapters.
  • Reusable behavior belongs in shared modules; child adapters should remain adapter-specific.
  • Human review labels include resource_inefficient for disproportionate runtime, token budget, cost, or command churn, but expensive is not automatically bad.
  • Token and cost tracking remain open work until the Codex event stream or another source exposes reliable metrics.
  • Reports must use evidence-based language: "under these conditions, this harness performed this way", not global model capability claims.
  • The first capability reports should be hand-authored interpretations in Markdown, even if drafted with AI assistance.
  • Reports should be readable by humans and by AI assistants helping a solo developer decide what tool fits their project.
  • GitHub Issues are the project issue tracker for PRDs and implementation work.
  • Architectural decisions should continue to live in ADRs; design docs summarize the current system shape.
  • Private career-planning context does not belong in public project docs.
  • The current task candidate backlog includes promoted Click tasks plus future candidates from Prettier, Vite, HTTPX, Express, and Remotion.
  • The current invalid default-map trial batch should be treated as dependency_issue evidence, not as fair Codex capability evidence.
  • The system needs an explicit invalid/excluded trial state so summaries can exclude unfair setup/dependency failures rather than relying only on human review labels.

Testing Decisions

  • Good tests should exercise external behavior: CLI output, result/report artifacts, task validation, grader command outcomes, reference verification outcomes, and summary rows.
  • Tests should avoid overfitting to implementation details such as private helper internals unless the helper exists specifically to enforce an externally important invariant.
  • Task loading tests should cover task bundle discovery, required fields, eval type validation, failure-mode validation, reference artifact validation, task-local environment fields, and rejection of paths outside the workspace.
  • Environment tests should cover PATH prepending, workspace placeholder expansion, and application of the same environment to setup, baseline, agent, and target commands.
  • Runner tests should cover isolated workspace preparation, manual no-pause trials, max files changed enforcement, task environment usage by graders, and collision-resistant trial IDs.
  • Codex adapter tests should cover command construction, PATH-based executable resolution, missing executable errors, capture of events/final messages, task environment propagation, and progress behavior.
  • CLI tests should cover jobs parsing, single-trial-first behavior where enforceable, quiet passing summaries, failed-trial detail output, and parallel progress behavior.
  • Reference verification tests should cover patch and commit artifacts, artifact failure handling, report/result writing, max files changed notes, and task-local environment usage.
  • Summary tests should cover pass rate, pass@k, pass^k, grouping dimensions, human review label counts, and eventually exclusion of invalid/dependency trials.
  • Task-card generator tests or checks should continue to ensure generated task cards and suite indexes match source metadata.
  • Pre-commit checks should continue to run task-card drift checks and task validation.
  • New task promotion should include reference verification before agent trials.
  • New task or environment changes should run one Codex trial/job before repeated trials.
  • Repeated trials should only be interpreted after confirming the single-trial path is fair.

Out of Scope

  • Comparing multiple agent harnesses in the immediate next slice. Codex deep baseline comes first.
  • Claiming universal model capability from local Codex CLI results.
  • Treating private career/recruiter goals as public project documentation.
  • Building a full web dashboard before Markdown capability reports are useful.
  • Adding model-based graders before deterministic graders and human review have enough calibration.
  • Implementing full token/cost accounting until reliable data is available from agent traces or event streams.
  • Making interactive clarification tasks part of the first non-interactive starter suite.
  • Depending on fixture repositories for the main credibility of the suite, except where a capability is difficult to capture naturally.
  • Running large parallel trial batches for new tasks before a one-trial smoke test passes.

Further Notes

  • Current implemented capabilities include task validation, task bundles, generated task cards, reference verification, manual adapter trials, Codex CLI adapter trials, deterministic graders, diff/report/result artifacts, human review labels, repeated-trial execution, concurrent jobs, pass@k/pass^k summaries, and task-local environments.
  • Two Click regression tasks have passed one fair Codex trial each after the task environment was corrected.
  • An earlier five-trial default-map batch failed because the task environment omitted source import configuration; those trials should be considered dependency_issue runs and excluded from capability interpretation.
  • The next high-leverage implementation is explicit invalid/excluded trial handling so summaries can separate fair capability trials from harness or environment failures.
  • After invalid/excluded trial handling, the next product milestone is a small repeated fair Codex batch across the verified tasks, followed by a hand-authored evidence-scoped capability report.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions