Snowl is an agent evaluation framework that is being hardened into an industrial-grade evaluation platform.
Its durable execution contract is:
- define
Task - define
Agent - define
Scorer - expand into
Task x AgentVariant x Sample - run with
snowl eval path/to/project.yml
Everything else in the repo exists to make that contract reliable, observable, and scalable:
- benchmark adaptation
- multi-model sweeps
- provider-aware concurrency control
- container/runtime orchestration
- artifact persistence
- CLI + Web operator workflows
- START_HERE.md: fastest repo orientation
- docs/project_map.md: codebase and directory map
- docs/current_state.md: implementation state, limits, and mismatches
- docs/architecture/runtime_and_scheduler.md: current eval/runtime/scheduler behavior
- docs/development_workflows.md: task-oriented developer workflows
- docs/testing_and_validation.md: focused validation commands and done criteria
- docs/codex_task_playbook.md: Codex-specific task triage guidance
- ARCHITECTURE.md: current system architecture and runtime direction
- PLANS.md: roadmap and execution priorities
- AGENTS.md: repo rules for coding agents
- docs/runtime_scheduling.md: forward-looking runtime scheduling design notes
- docs/codex_best_practices.md: Codex guidance for this repo
Snowl already supports:
- YAML-first project entrypoints via
project.yml - project-level multi-model authoring via
agent_matrix.models - benchmark adapters for:
strongrejectterminalbenchosworldtoolemuagentsafetybench
- provider-aware local concurrency control
- container-aware execution for terminal / GUI benchmarks
- live artifacts under
.snowl/runs/ - operator-focused Next.js Web monitor
- plain foreground CLI progress with background Web monitor sidecar
snowl retry <run_id>recovery for unfinished and non-success trials in the same run- in-run deferred auto retry for non-success trials, with attempt-aware recovery history
Deployment target today is still local single-machine evaluation.
cd /Users/morinop/coding/snowl_v2
pip install -e .This editable install also builds the bundled Web UI used by the packaged monitor.
Several official examples depend on external benchmark repos checked out under fixed paths:
references/terminal-benchreferences/OSWorldreferences/strongrejectreferences/ToolEmureferences/Agent-SafetyBench
Example:
cd /Users/morinop/coding/snowl_v2
git clone <TERMINAL_BENCH_GIT_URL> references/terminal-bench
git clone <OSWORLD_GIT_URL> references/OSWorld
git clone <STRONGREJECT_GIT_URL> references/strongreject
git clone <TOOLEMU_GIT_URL> references/ToolEmu
git clone <AGENT_SAFETY_BENCH_GIT_URL> references/Agent-SafetyBenchsnowl eval /Users/morinop/coding/snowl_v2/examples/strongreject-official/project.ymlOther official examples:
snowl eval /Users/morinop/coding/snowl_v2/examples/terminalbench-official/project.yml
snowl eval /Users/morinop/coding/snowl_v2/examples/osworld-official/project.yml
snowl eval /Users/morinop/coding/snowl_v2/examples/toolemu-official/project.yml
snowl eval /Users/morinop/coding/snowl_v2/examples/agentsafetybench-official/project.ymlsnowl bench listsnowl bench run terminalbench \
--project /Users/morinop/coding/snowl_v2/examples/terminalbench-official/project.yml \
--split testThe default CLI behavior is:
- foreground: plain terminal progress/logging for the eval itself
- background: auto-started Web monitor sidecar
- optional:
--cli-uienables the legacy live terminal UI
Typical flow:
snowl eval /absolute/path/to/my-project/project.ymlWhat happens:
- the terminal prints project/run bootstrap details
- the eval begins in the foreground
- once the run is initialized, Snowl prints a Web URL such as
http://127.0.0.1:8765 - stopping the eval also stops that auto-started monitor sidecar
Useful flags:
- filtering:
--task,--agent,--variant - runtime budgets:
--max-running-trials--max-container-slots--max-builds--max-scoring-tasks--provider-budget provider_id=n
- recovery:
snowl retry <run_id> - monitor:
--no-web-monitor - legacy live CLI:
--cli-ui
Recover a long-running run after fixing the environment:
snowl retry run-20260311T033703Z --project /absolute/path/to/my-project/project.ymlManual monitor mode is still available:
snowl web monitor --project /absolute/path/to/my-project --host 127.0.0.1 --port 8765Snowl now treats one YAML file as the formal entrypoint for a run.
Recommended layout:
my-project/
project.yml
task.py
agent.py
scorer.py
tool.py # optional
Example:
project:
name: strongreject-qwen-sweep
root_dir: .
provider:
id: siliconflow
kind: openai_compatible
base_url: https://api.siliconflow.cn/v1
api_key: sk-...
timeout: 30
max_retries: 2
agent_matrix:
models:
- id: qwen25_7b
model: Qwen/Qwen2.5-7B-Instruct
- id: qwen3_32b
model: Qwen/Qwen3-32B
judge:
model: gpt-4.1-mini
eval:
benchmark: strongreject
code:
base_dir: .
task_module: ./task.py
agent_module: ./agent.py
scorer_module: ./scorer.py
tool_module: ./tool.py
split: test
limit: 50
runtime:
max_running_trials: 8
max_container_slots: 0
max_builds: 2
max_scoring_tasks: 8
provider_budgets:
siliconflow: 8
recovery:
auto_retry_non_success: true
max_auto_retries_per_trial: 1
retry_timing: deferred
backoff_ms: 2000Key semantics:
project.root_dir: project root for artifact placement and relative pathseval.code.base_dir: code loading root fortask.py,agent.py,scorer.py,tool.pyprovider: the project's remote model provideragent_matrix.models: the tested models that expand intoAgentVariantsjudge.model: optional model-as-judge model, separate from the tested modelsruntime.provider_budgets: provider-level concurrency limitsruntime.recovery: in-run deferred retry policy and retry budget
The directory structure still matters; YAML just makes that structure explicit instead of implicit.
The recommended pattern in agent.py is:
from pathlib import Path
from snowl.agents import build_model_variants
from snowl.core import agent
def build_agent_for_model(model_entry, provider_config):
...
@agent(agent_id="demo_agent")
def agents():
return build_model_variants(
base_dir=Path(__file__).parent,
agent_id="demo_agent",
factory=build_agent_for_model,
)This same pattern now powers both QA-style examples and container-heavy examples such as TerminalBench and OSWorld.
Snowl is moving from coarse semaphores toward an explicit phase-aware scheduler.
Runtime controls now separate:
max_running_trials: active agent executionmax_container_slots: active container/sandbox capacitymax_builds: expensive build/pull/setup workmax_scoring_tasks: scoring concurrencyprovider_budgets[provider_id]: remote provider concurrency
Current runtime behavior already reflects two important architectural decisions:
- provider is the main concurrency boundary for remote model calls
- scoring is no longer forced to occupy the same execution slot as agent execution
That means QA workloads and container-heavy workloads can share one scheduler while consuming different budgets.
Each run writes under:
<project>/.snowl/runs/<run_id>/
Important artifacts include:
manifest.jsonplan.jsonsummary.jsonaggregate.jsonprofiling.jsontrials.jsonlevents.jsonlrun.log
Observability surfaces:
- CLI: operator-friendly foreground progress/logging
- Web monitor:
/: run gallery / operator board/runs/[runId]: single-run workspace/compare: secondary historical comparison view
Running runs are expected to become visible immediately. Snowl writes bootstrap artifacts early so the Web monitor can show planned trials, visible tasks, models, and progress before the run completes.
See examples/README.md for the convention used by official examples.
Python tests:
pytest -qRuntime-focused:
pytest -q tests/test_eval_autodiscovery.py tests/test_runtime_controls_and_profiling.py tests/test_resource_scheduler.py tests/test_cli_eval.pySynthetic scheduler benchmark:
python scripts/runtime_scheduler_benchmark.pyWeb UI typecheck:
cd webui
npm run -s typecheckPackaged install sanity:
pip install -e .