Problem Statement
Agent Eval Lab has grown from a starter harness into the early shape of a credible coding-agent evaluation lab. It can run tasks, invoke Codex CLI, capture outcomes, verify reference artifacts, summarize repeated trials, and record human reviews. However, the project still needs a coherent next-phase product plan so work does not scatter across task curation, agent integration, reporting, environment setup, and reliability metrics.
From the user's perspective, the core problem is: "I want to understand, with evidence, what coding-agent harnesses do well, poorly, or inconsistently on realistic software engineering tasks, starting with Codex CLI. I want the output to be useful to technical evaluators and to solo developers deciding which AI coding tool fits their use case. I do not want the lab to overgeneralize beyond observed evidence."
The current state has also surfaced important operational lessons:
- New task or grader behavior must be smoke-tested with one trial and one job before scaling to repeated or parallel trials.
- Environment failures should be diagnosed as environment failures, not hidden behind app-level workarounds.
- Task-local environments must make likely self-check tools available to both the agent harness and deterministic graders.
- Reports must distinguish deterministic grader outcomes from human review outcomes.
- Repeated-trial summaries must avoid mixing invalid setup/dependency failures into fair capability claims.
Solution
Build the next phase around a Codex deep baseline for the Solo Dev Starter Suite.
The lab should curate realistic task bundles, verify each task with a reference artifact, smoke-test each task with one Codex trial, then run bounded repeated Codex trials only after the single-trial path is fair. The output should be an evidence-scoped Markdown agent capability report that explains what Codex CLI did well, poorly, and inconsistently under the evaluated conditions.
The solution should continue to follow Anthropic-aligned terminology: tasks, trials, evaluation suites, agent harnesses, graders/assertions, traces/transcripts, outcomes, reference verification, and pass@k/pass^k. It should keep "agent harness" separate from "underlying model" and avoid global claims such as "model Y has capability Z" unless the evidence supports that scope.
The project should also strengthen the harness around invalid/excluded trials, task-local environments, outcome evidence, and report generation so future task batches are more trustworthy and easier to interpret.
User Stories
- As a technical evaluator, I want each coding task to pin a real repository and commit, so that trial results are reproducible.
- As a technical evaluator, I want each task to live in a task bundle, so that metadata, generated task cards, and reference artifacts stay together.
- As a task curator, I want task cards generated from task metadata, so that human-readable docs cannot drift silently from the source of truth.
- As a task curator, I want suite indexes generated from task bundles, so that humans and AI assistants can quickly scan the available tasks.
- As a task curator, I want a pre-commit check for task-card drift, so that generated review artifacts stay aligned before publishing.
- As a task curator, I want each publishable task to include a verified reference artifact, so that the task is proven solvable before agents are evaluated.
- As a task curator, I want reference verification to produce the same kind of report/result artifact as agent trials, so that reference outcomes can be inspected consistently.
- As a task curator, I want reference verification to enforce success criteria such as max files changed, so that reference artifacts remain focused.
- As a task curator, I want task-local setup commands, so that each task can provision its own dependencies.
- As a task curator, I want task-local environment variables and PATH entries, so that agents and graders use the same task tools.
- As a task curator, I want task-local environments to expose obvious self-check tools like pytest, so that agent harnesses can run natural validation commands.
- As a maintainer, I want new task environments smoke-tested with one trial and one job, so that unfair setup problems are caught before spending repeated trials.
- As a maintainer, I want parallel trial execution after smoke tests pass, so that repeated trials do not take unnecessarily long.
- As a maintainer, I want successful multi-trial batches to print only aggregate summaries, so that terminal output remains readable.
- As a maintainer, I want failed multi-trial batches to print failed trial IDs and report paths, so that I can quickly inspect the relevant artifacts.
- As a maintainer, I want agent launch errors printed to stderr, so that I do not need to open transcripts to discover basic adapter failures.
- As a maintainer, I want reusable terminal/error behavior in shared layers, so that adapter-specific classes stay focused.
- As a maintainer, I want environment problems fixed at the environment layer, so that portable harness behavior does not become a pile of machine-specific workarounds.
- As a maintainer, I want agent executable discovery to rely on PATH or explicit configuration, so that the lab remains portable.
- As a maintainer, I want no hard-coded user-machine paths, so that other contributors can run the lab.
- As a maintainer, I want task run IDs to be collision-resistant, so that concurrent trials cannot overwrite each other.
- As a maintainer, I want manual trials, reference verification, and Codex trials to share the same grader semantics, so that outcomes are comparable.
- As a maintainer, I want manual trials to remain available as positive and negative controls, so that new harness behavior can be validated before agent integrations.
- As an evaluator, I want Codex CLI invoked non-interactively as an agent harness adapter, so that Codex can be evaluated reproducibly.
- As an evaluator, I want Codex CLI traces, final messages, diffs, reports, and result metadata captured, so that I can inspect both trajectory and outcome.
- As an evaluator, I want the report to show changed files and deterministic assertions, so that I can understand exactly why a trial passed or failed.
- As an evaluator, I want result metadata to include trial kind, suite, eval type, task, agent harness, model when known, checks, outcome, duration, changed files, and errors, so that later summaries can be computed.
- As an evaluator, I want repeated trial summaries grouped by suite, eval type, task, agent harness, and model, so that aggregation compares like with like.
- As an evaluator, I want summaries to report pass rate, pass@k, and pass^k, so that I can distinguish "eventually succeeds" from "succeeds consistently."
- As an evaluator, I want invalid setup/dependency trials excluded or clearly labeled, so that pass rates do not misrepresent agent capability.
- As an evaluator, I want human review labels separate from deterministic grader status, so that code-passing but messy or risky patches can be recorded accurately.
- As an evaluator, I want review labels such as success_clean, success_messy, test_gap, over_edit, tool_misuse, dependency_issue, and resource_inefficient, so that failure modes are comparable across trials.
- As an evaluator, I want human review evidence attached to labels, so that review judgments are auditable.
- As an evaluator, I want edit size metrics such as files changed and eventually lines added/deleted, so that over-editing can be detected.
- As an evaluator, I want resource usage metrics such as duration, token usage, and cost when available, so that resource-inefficient behavior can be identified.
- As a model-quality engineer, I want reports to scope claims to evaluated conditions, so that the lab does not overclaim global model capability.
- As a model-quality engineer, I want transcript review available when deterministic graders fail, so that I can decide whether the grader rejected a valid solution.
- As a model-quality engineer, I want task prompts to be realistic but grader expectations unambiguous, so that failures are meaningful rather than prompt-noise artifacts.
- As a solo developer, I want reports to explain what an agent harness does well, poorly, and inconsistently, so that I can choose a tool that fits my project.
- As a solo developer, I want the report to be Markdown and AI-readable, so that I can ask another AI assistant to help interpret the evidence for my situation.
- As a solo developer, I want recommendations scoped to evidence, so that I can act on them without mistaking them for universal rankings.
- As a solo developer, I want tasks to reflect realistic maintenance work, so that the evaluation feels relevant to real projects.
- As a solo developer, I want at least one UI or visual task eventually included in the starter suite, so that visual/UX agent behavior is represented.
- As a future evaluator, I want the architecture to support interactive tasks later, so that tasks can allow follow-up questions when that behavior is part of the evaluation.
- As a future evaluator, I want follow-up-question quality to be gradable later, so that interactive agent behavior can be measured without redesigning the whole task schema.
- As a maintainer, I want GitHub Issues to hold PRDs and implementation tickets, so that project work is visible and ready for agents.
- As a maintainer, I want ready-for-agent issues to be well specified, so that an AFK coding agent can pick them up safely.
- As a maintainer, I want architectural decisions recorded as ADRs, so that later contributors understand why the harness behaves the way it does.
- As a maintainer, I want project-private career context kept out of public docs, so that public artifacts remain about the product and evaluation evidence.
- As a maintainer, I want all public docs to use the ubiquitous language consistently, so that agents and humans interpret the project the same way.
Implementation Decisions
- The project will use "agent harness" as the comparison target term, not "agent tool", "agent application", or "agent harness" interchangeably. The underlying model is a separate dimension.
- The first serious evaluation effort is the Codex deep baseline, not an immediate multi-harness comparison. Codex CLI is tested deeply first because it is installed and available, while other harnesses such as Claude Code or Cursor can be integrated later.
- The first suite is the Solo Dev Starter Suite: realistic, small maintenance tasks relevant to solo developers and technical evaluators.
- The suite should be mixed-language over time, but the initial concrete tasks are Python/Click regressions plus the existing 2048 capability task.
- Each task is represented as a task bundle with a source task definition, generated task card, and optional reference artifacts.
- Task definitions are the source of truth; task cards and suite indexes are generated review artifacts.
- Reference artifacts can be patches or commits, but publishable tasks should have verified reference artifacts rather than prose-only reference notes.
- Reference verification uses the same report/result shape as agent trials and is marked as reference verification.
- The deterministic grader outcome remains separate from the human review outcome.
- Code-based graders are the default for coding correctness. Transcript/tool-call graders should be added only when they measure behavior outcome graders cannot capture.
- The runner applies task-local environment configuration to setup, baseline, agent, and target grader commands.
- Task-local environments currently support workspace-relative PATH prepends and environment variables with a workspace placeholder.
- Python tasks can create a local virtual environment during setup and expose its tools to both agent and graders.
- For Click tasks, the environment must expose the source checkout explicitly to avoid import failures caused by editable-install path handling in paths containing spaces.
- New task, grader, environment, or agent-harness behavior must be tested with one trial and one job before repeated or parallel trials.
- Once a single-trial smoke test passes, repeated trials can use bounded concurrency through the jobs option.
- Successful multi-trial batches should print compact aggregate summaries. Failed batches should identify failed trials and report/result paths.
- Trial directories should be unique even for concurrent trials started in the same second.
- Codex CLI should be discovered through PATH or configured explicitly. The lab should not guess machine-specific install paths.
- Agent launch errors should surface in stderr through shared summary/error behavior, not only in transcripts.
- Environment failures should be fixed in environment/task configuration, not worked around in child adapters.
- Reusable behavior belongs in shared modules; child adapters should remain adapter-specific.
- Human review labels include resource_inefficient for disproportionate runtime, token budget, cost, or command churn, but expensive is not automatically bad.
- Token and cost tracking remain open work until the Codex event stream or another source exposes reliable metrics.
- Reports must use evidence-based language: "under these conditions, this harness performed this way", not global model capability claims.
- The first capability reports should be hand-authored interpretations in Markdown, even if drafted with AI assistance.
- Reports should be readable by humans and by AI assistants helping a solo developer decide what tool fits their project.
- GitHub Issues are the project issue tracker for PRDs and implementation work.
- Architectural decisions should continue to live in ADRs; design docs summarize the current system shape.
- Private career-planning context does not belong in public project docs.
- The current task candidate backlog includes promoted Click tasks plus future candidates from Prettier, Vite, HTTPX, Express, and Remotion.
- The current invalid default-map trial batch should be treated as dependency_issue evidence, not as fair Codex capability evidence.
- The system needs an explicit invalid/excluded trial state so summaries can exclude unfair setup/dependency failures rather than relying only on human review labels.
Testing Decisions
- Good tests should exercise external behavior: CLI output, result/report artifacts, task validation, grader command outcomes, reference verification outcomes, and summary rows.
- Tests should avoid overfitting to implementation details such as private helper internals unless the helper exists specifically to enforce an externally important invariant.
- Task loading tests should cover task bundle discovery, required fields, eval type validation, failure-mode validation, reference artifact validation, task-local environment fields, and rejection of paths outside the workspace.
- Environment tests should cover PATH prepending, workspace placeholder expansion, and application of the same environment to setup, baseline, agent, and target commands.
- Runner tests should cover isolated workspace preparation, manual no-pause trials, max files changed enforcement, task environment usage by graders, and collision-resistant trial IDs.
- Codex adapter tests should cover command construction, PATH-based executable resolution, missing executable errors, capture of events/final messages, task environment propagation, and progress behavior.
- CLI tests should cover jobs parsing, single-trial-first behavior where enforceable, quiet passing summaries, failed-trial detail output, and parallel progress behavior.
- Reference verification tests should cover patch and commit artifacts, artifact failure handling, report/result writing, max files changed notes, and task-local environment usage.
- Summary tests should cover pass rate, pass@k, pass^k, grouping dimensions, human review label counts, and eventually exclusion of invalid/dependency trials.
- Task-card generator tests or checks should continue to ensure generated task cards and suite indexes match source metadata.
- Pre-commit checks should continue to run task-card drift checks and task validation.
- New task promotion should include reference verification before agent trials.
- New task or environment changes should run one Codex trial/job before repeated trials.
- Repeated trials should only be interpreted after confirming the single-trial path is fair.
Out of Scope
- Comparing multiple agent harnesses in the immediate next slice. Codex deep baseline comes first.
- Claiming universal model capability from local Codex CLI results.
- Treating private career/recruiter goals as public project documentation.
- Building a full web dashboard before Markdown capability reports are useful.
- Adding model-based graders before deterministic graders and human review have enough calibration.
- Implementing full token/cost accounting until reliable data is available from agent traces or event streams.
- Making interactive clarification tasks part of the first non-interactive starter suite.
- Depending on fixture repositories for the main credibility of the suite, except where a capability is difficult to capture naturally.
- Running large parallel trial batches for new tasks before a one-trial smoke test passes.
Further Notes
- Current implemented capabilities include task validation, task bundles, generated task cards, reference verification, manual adapter trials, Codex CLI adapter trials, deterministic graders, diff/report/result artifacts, human review labels, repeated-trial execution, concurrent jobs, pass@k/pass^k summaries, and task-local environments.
- Two Click regression tasks have passed one fair Codex trial each after the task environment was corrected.
- An earlier five-trial default-map batch failed because the task environment omitted source import configuration; those trials should be considered dependency_issue runs and excluded from capability interpretation.
- The next high-leverage implementation is explicit invalid/excluded trial handling so summaries can separate fair capability trials from harness or environment failures.
- After invalid/excluded trial handling, the next product milestone is a small repeated fair Codex batch across the verified tasks, followed by a hand-authored evidence-scoped capability report.
Problem Statement
Agent Eval Lab has grown from a starter harness into the early shape of a credible coding-agent evaluation lab. It can run tasks, invoke Codex CLI, capture outcomes, verify reference artifacts, summarize repeated trials, and record human reviews. However, the project still needs a coherent next-phase product plan so work does not scatter across task curation, agent integration, reporting, environment setup, and reliability metrics.
From the user's perspective, the core problem is: "I want to understand, with evidence, what coding-agent harnesses do well, poorly, or inconsistently on realistic software engineering tasks, starting with Codex CLI. I want the output to be useful to technical evaluators and to solo developers deciding which AI coding tool fits their use case. I do not want the lab to overgeneralize beyond observed evidence."
The current state has also surfaced important operational lessons:
Solution
Build the next phase around a Codex deep baseline for the Solo Dev Starter Suite.
The lab should curate realistic task bundles, verify each task with a reference artifact, smoke-test each task with one Codex trial, then run bounded repeated Codex trials only after the single-trial path is fair. The output should be an evidence-scoped Markdown agent capability report that explains what Codex CLI did well, poorly, and inconsistently under the evaluated conditions.
The solution should continue to follow Anthropic-aligned terminology: tasks, trials, evaluation suites, agent harnesses, graders/assertions, traces/transcripts, outcomes, reference verification, and pass@k/pass^k. It should keep "agent harness" separate from "underlying model" and avoid global claims such as "model Y has capability Z" unless the evidence supports that scope.
The project should also strengthen the harness around invalid/excluded trials, task-local environments, outcome evidence, and report generation so future task batches are more trustworthy and easier to interpret.
User Stories
Implementation Decisions
Testing Decisions
Out of Scope
Further Notes