Snowl

English | 简体中文

Snowl is an agent evaluation framework that is being hardened into an industrial-grade evaluation platform.

Its durable execution contract is:

define Task
define Agent
define Scorer
expand into Task x AgentVariant x Sample
run with snowl eval path/to/project.yml

Everything else in the repo exists to make that contract reliable, observable, and scalable:

benchmark adaptation
multi-model sweeps
provider-aware concurrency control
container/runtime orchestration
artifact persistence
CLI + Web operator workflows

Project Docs

START_HERE.md: fastest repo orientation
docs/project_map.md: codebase and directory map
docs/current_state.md: implementation state, limits, and mismatches
docs/architecture/runtime_and_scheduler.md: current eval/runtime/scheduler behavior
docs/development_workflows.md: task-oriented developer workflows
docs/testing_and_validation.md: focused validation commands and done criteria
docs/codex_task_playbook.md: Codex-specific task triage guidance
ARCHITECTURE.md: current system architecture and runtime direction
PLANS.md: roadmap and execution priorities
AGENTS.md: repo rules for coding agents
docs/runtime_scheduling.md: forward-looking runtime scheduling design notes
docs/codex_best_practices.md: Codex guidance for this repo

Current Product Shape

Snowl already supports:

YAML-first project entrypoints via project.yml
project-level multi-model authoring via agent_matrix.models
benchmark adapters for:
- strongreject
- terminalbench
- osworld
- toolemu
- agentsafetybench
provider-aware local concurrency control
container-aware execution for terminal / GUI benchmarks
live artifacts under .snowl/runs/
operator-focused Next.js Web monitor
plain foreground CLI progress with background Web monitor sidecar
snowl retry <run_id> recovery for unfinished and non-success trials in the same run
in-run deferred auto retry for non-success trials, with attempt-aware recovery history

Deployment target today is still local single-machine evaluation.

Install

cd /Users/morinop/coding/snowl_v2
pip install -e .

This editable install also builds the bundled Web UI used by the packaged monitor.

Prepare Reference Repos

Several official examples depend on external benchmark repos checked out under fixed paths:

references/terminal-bench
references/OSWorld
references/strongreject
references/ToolEmu
references/Agent-SafetyBench

Example:

cd /Users/morinop/coding/snowl_v2
git clone <TERMINAL_BENCH_GIT_URL> references/terminal-bench
git clone <OSWORLD_GIT_URL> references/OSWorld
git clone <STRONGREJECT_GIT_URL> references/strongreject
git clone <TOOLEMU_GIT_URL> references/ToolEmu
git clone <AGENT_SAFETY_BENCH_GIT_URL> references/Agent-SafetyBench

Quick Start

Run an official example

snowl eval /Users/morinop/coding/snowl_v2/examples/strongreject-official/project.yml

Other official examples:

snowl eval /Users/morinop/coding/snowl_v2/examples/terminalbench-official/project.yml
snowl eval /Users/morinop/coding/snowl_v2/examples/osworld-official/project.yml
snowl eval /Users/morinop/coding/snowl_v2/examples/toolemu-official/project.yml
snowl eval /Users/morinop/coding/snowl_v2/examples/agentsafetybench-official/project.yml

Run through a benchmark adapter

snowl bench list

snowl bench run terminalbench \
  --project /Users/morinop/coding/snowl_v2/examples/terminalbench-official/project.yml \
  --split test

Default Runtime UX

The default CLI behavior is:

foreground: plain terminal progress/logging for the eval itself
background: auto-started Web monitor sidecar
optional: --cli-ui enables the legacy live terminal UI

Typical flow:

snowl eval /absolute/path/to/my-project/project.yml

What happens:

the terminal prints project/run bootstrap details
the eval begins in the foreground
once the run is initialized, Snowl prints a Web URL such as http://127.0.0.1:8765
stopping the eval also stops that auto-started monitor sidecar

Useful flags:

filtering: --task, --agent, --variant
runtime budgets:
- --max-running-trials
- --max-container-slots
- --max-builds
- --max-scoring-tasks
- --provider-budget provider_id=n
recovery: snowl retry <run_id>
monitor: --no-web-monitor
legacy live CLI: --cli-ui

Recover a long-running run after fixing the environment:

snowl retry run-20260311T033703Z --project /absolute/path/to/my-project/project.yml

Manual monitor mode is still available:

snowl web monitor --project /absolute/path/to/my-project --host 127.0.0.1 --port 8765

`project.yml` Is The Source Of Truth

Snowl now treats one YAML file as the formal entrypoint for a run.

Recommended layout:

my-project/
  project.yml
  task.py
  agent.py
  scorer.py
  tool.py        # optional

Example:

project:
  name: strongreject-qwen-sweep
  root_dir: .

provider:
  id: siliconflow
  kind: openai_compatible
  base_url: https://api.siliconflow.cn/v1
  api_key: sk-...
  timeout: 30
  max_retries: 2

agent_matrix:
  models:
    - id: qwen25_7b
      model: Qwen/Qwen2.5-7B-Instruct
    - id: qwen3_32b
      model: Qwen/Qwen3-32B

judge:
  model: gpt-4.1-mini

eval:
  benchmark: strongreject
  code:
    base_dir: .
    task_module: ./task.py
    agent_module: ./agent.py
    scorer_module: ./scorer.py
    tool_module: ./tool.py
  split: test
  limit: 50

runtime:
  max_running_trials: 8
  max_container_slots: 0
  max_builds: 2
  max_scoring_tasks: 8
  provider_budgets:
    siliconflow: 8
  recovery:
    auto_retry_non_success: true
    max_auto_retries_per_trial: 1
    retry_timing: deferred
    backoff_ms: 2000

Key semantics:

project.root_dir: project root for artifact placement and relative paths
eval.code.base_dir: code loading root for task.py, agent.py, scorer.py, tool.py
provider: the project's remote model provider
agent_matrix.models: the tested models that expand into AgentVariants
judge.model: optional model-as-judge model, separate from the tested models
runtime.provider_budgets: provider-level concurrency limits
runtime.recovery: in-run deferred retry policy and retry budget

The directory structure still matters; YAML just makes that structure explicit instead of implicit.

Multi-Model Authoring

The recommended pattern in agent.py is:

from pathlib import Path

from snowl.agents import build_model_variants
from snowl.core import agent


def build_agent_for_model(model_entry, provider_config):
    ...


@agent(agent_id="demo_agent")
def agents():
    return build_model_variants(
        base_dir=Path(__file__).parent,
        agent_id="demo_agent",
        factory=build_agent_for_model,
    )

This same pattern now powers both QA-style examples and container-heavy examples such as TerminalBench and OSWorld.

Runtime Scheduling Model

Snowl is moving from coarse semaphores toward an explicit phase-aware scheduler.

Runtime controls now separate:

max_running_trials: active agent execution
max_container_slots: active container/sandbox capacity
max_builds: expensive build/pull/setup work
max_scoring_tasks: scoring concurrency
provider_budgets[provider_id]: remote provider concurrency

Current runtime behavior already reflects two important architectural decisions:

provider is the main concurrency boundary for remote model calls
scoring is no longer forced to occupy the same execution slot as agent execution

That means QA workloads and container-heavy workloads can share one scheduler while consuming different budgets.

Artifacts And Observability

Each run writes under:

<project>/.snowl/runs/<run_id>/

Important artifacts include:

manifest.json
plan.json
summary.json
aggregate.json
profiling.json
trials.jsonl
events.jsonl
run.log

Observability surfaces:

CLI: operator-friendly foreground progress/logging
Web monitor:
- /: run gallery / operator board
- /runs/[runId]: single-run workspace
- /compare: secondary historical comparison view

Running runs are expected to become visible immediately. Snowl writes bootstrap artifacts early so the Web monitor can show planned trials, visible tasks, models, and progress before the run completes.

Examples

See examples/README.md for the convention used by official examples.

Development Checks

Python tests:

pytest -q

Runtime-focused:

pytest -q tests/test_eval_autodiscovery.py tests/test_runtime_controls_and_profiling.py tests/test_resource_scheduler.py tests/test_cli_eval.py

Synthetic scheduler benchmark:

python scripts/runtime_scheduler_benchmark.py

Web UI typecheck:

cd webui
npm run -s typecheck

Packaged install sanity:

pip install -e .

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
.github/workflows		.github/workflows
.playwright-cli		.playwright-cli
assets		assets
docs		docs
examples		examples
scripts		scripts
snowl		snowl
tests		tests
webui		webui
.gitignore		.gitignore
AGENTS.md		AGENTS.md
ARCHITECTURE.md		ARCHITECTURE.md
DESIGN.md		DESIGN.md
MANIFEST.in		MANIFEST.in
PLANS.md		PLANS.md
README.md		README.md
README.zh-CN.md		README.zh-CN.md
START_HERE.md		START_HERE.md
docs.zip		docs.zip
pyproject.toml		pyproject.toml
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Snowl

Project Docs

Current Product Shape

Install

Prepare Reference Repos

Quick Start

Run an official example

Run through a benchmark adapter

Default Runtime UX

`project.yml` Is The Source Of Truth

Multi-Model Authoring

Runtime Scheduling Model

Artifacts And Observability

Examples

Development Checks

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Snowl

Project Docs

Current Product Shape

Install

Prepare Reference Repos

Quick Start

Run an official example

Run through a benchmark adapter

Default Runtime UX

project.yml Is The Source Of Truth

Multi-Model Authoring

Runtime Scheduling Model

Artifacts And Observability

Examples

Development Checks

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`project.yml` Is The Source Of Truth

Packages