Skip to content

Qitor/snowl

Repository files navigation

Snowl

English | 简体中文

Snowl is an agent evaluation framework that is being hardened into an industrial-grade evaluation platform.

Its durable execution contract is:

  • define Task
  • define Agent
  • define Scorer
  • expand into Task x AgentVariant x Sample
  • run with snowl eval path/to/project.yml

Everything else in the repo exists to make that contract reliable, observable, and scalable:

  • benchmark adaptation
  • multi-model sweeps
  • provider-aware concurrency control
  • container/runtime orchestration
  • artifact persistence
  • CLI + Web operator workflows

Project Docs

Current Product Shape

Snowl already supports:

  • YAML-first project entrypoints via project.yml
  • project-level multi-model authoring via agent_matrix.models
  • benchmark adapters for:
    • strongreject
    • terminalbench
    • osworld
    • toolemu
    • agentsafetybench
  • provider-aware local concurrency control
  • container-aware execution for terminal / GUI benchmarks
  • live artifacts under .snowl/runs/
  • operator-focused Next.js Web monitor
  • plain foreground CLI progress with background Web monitor sidecar
  • snowl retry <run_id> recovery for unfinished and non-success trials in the same run
  • in-run deferred auto retry for non-success trials, with attempt-aware recovery history

Deployment target today is still local single-machine evaluation.

Install

cd /Users/morinop/coding/snowl_v2
pip install -e .

This editable install also builds the bundled Web UI used by the packaged monitor.

Prepare Reference Repos

Several official examples depend on external benchmark repos checked out under fixed paths:

  • references/terminal-bench
  • references/OSWorld
  • references/strongreject
  • references/ToolEmu
  • references/Agent-SafetyBench

Example:

cd /Users/morinop/coding/snowl_v2
git clone <TERMINAL_BENCH_GIT_URL> references/terminal-bench
git clone <OSWORLD_GIT_URL> references/OSWorld
git clone <STRONGREJECT_GIT_URL> references/strongreject
git clone <TOOLEMU_GIT_URL> references/ToolEmu
git clone <AGENT_SAFETY_BENCH_GIT_URL> references/Agent-SafetyBench

Quick Start

Run an official example

snowl eval /Users/morinop/coding/snowl_v2/examples/strongreject-official/project.yml

Other official examples:

snowl eval /Users/morinop/coding/snowl_v2/examples/terminalbench-official/project.yml
snowl eval /Users/morinop/coding/snowl_v2/examples/osworld-official/project.yml
snowl eval /Users/morinop/coding/snowl_v2/examples/toolemu-official/project.yml
snowl eval /Users/morinop/coding/snowl_v2/examples/agentsafetybench-official/project.yml

Run through a benchmark adapter

snowl bench list
snowl bench run terminalbench \
  --project /Users/morinop/coding/snowl_v2/examples/terminalbench-official/project.yml \
  --split test

Default Runtime UX

The default CLI behavior is:

  • foreground: plain terminal progress/logging for the eval itself
  • background: auto-started Web monitor sidecar
  • optional: --cli-ui enables the legacy live terminal UI

Typical flow:

snowl eval /absolute/path/to/my-project/project.yml

What happens:

  1. the terminal prints project/run bootstrap details
  2. the eval begins in the foreground
  3. once the run is initialized, Snowl prints a Web URL such as http://127.0.0.1:8765
  4. stopping the eval also stops that auto-started monitor sidecar

Useful flags:

  • filtering: --task, --agent, --variant
  • runtime budgets:
    • --max-running-trials
    • --max-container-slots
    • --max-builds
    • --max-scoring-tasks
    • --provider-budget provider_id=n
  • recovery: snowl retry <run_id>
  • monitor: --no-web-monitor
  • legacy live CLI: --cli-ui

Recover a long-running run after fixing the environment:

snowl retry run-20260311T033703Z --project /absolute/path/to/my-project/project.yml

Manual monitor mode is still available:

snowl web monitor --project /absolute/path/to/my-project --host 127.0.0.1 --port 8765

project.yml Is The Source Of Truth

Snowl now treats one YAML file as the formal entrypoint for a run.

Recommended layout:

my-project/
  project.yml
  task.py
  agent.py
  scorer.py
  tool.py        # optional

Example:

project:
  name: strongreject-qwen-sweep
  root_dir: .

provider:
  id: siliconflow
  kind: openai_compatible
  base_url: https://api.siliconflow.cn/v1
  api_key: sk-...
  timeout: 30
  max_retries: 2

agent_matrix:
  models:
    - id: qwen25_7b
      model: Qwen/Qwen2.5-7B-Instruct
    - id: qwen3_32b
      model: Qwen/Qwen3-32B

judge:
  model: gpt-4.1-mini

eval:
  benchmark: strongreject
  code:
    base_dir: .
    task_module: ./task.py
    agent_module: ./agent.py
    scorer_module: ./scorer.py
    tool_module: ./tool.py
  split: test
  limit: 50

runtime:
  max_running_trials: 8
  max_container_slots: 0
  max_builds: 2
  max_scoring_tasks: 8
  provider_budgets:
    siliconflow: 8
  recovery:
    auto_retry_non_success: true
    max_auto_retries_per_trial: 1
    retry_timing: deferred
    backoff_ms: 2000

Key semantics:

  • project.root_dir: project root for artifact placement and relative paths
  • eval.code.base_dir: code loading root for task.py, agent.py, scorer.py, tool.py
  • provider: the project's remote model provider
  • agent_matrix.models: the tested models that expand into AgentVariants
  • judge.model: optional model-as-judge model, separate from the tested models
  • runtime.provider_budgets: provider-level concurrency limits
  • runtime.recovery: in-run deferred retry policy and retry budget

The directory structure still matters; YAML just makes that structure explicit instead of implicit.

Multi-Model Authoring

The recommended pattern in agent.py is:

from pathlib import Path

from snowl.agents import build_model_variants
from snowl.core import agent


def build_agent_for_model(model_entry, provider_config):
    ...


@agent(agent_id="demo_agent")
def agents():
    return build_model_variants(
        base_dir=Path(__file__).parent,
        agent_id="demo_agent",
        factory=build_agent_for_model,
    )

This same pattern now powers both QA-style examples and container-heavy examples such as TerminalBench and OSWorld.

Runtime Scheduling Model

Snowl is moving from coarse semaphores toward an explicit phase-aware scheduler.

Runtime controls now separate:

  • max_running_trials: active agent execution
  • max_container_slots: active container/sandbox capacity
  • max_builds: expensive build/pull/setup work
  • max_scoring_tasks: scoring concurrency
  • provider_budgets[provider_id]: remote provider concurrency

Current runtime behavior already reflects two important architectural decisions:

  • provider is the main concurrency boundary for remote model calls
  • scoring is no longer forced to occupy the same execution slot as agent execution

That means QA workloads and container-heavy workloads can share one scheduler while consuming different budgets.

Artifacts And Observability

Each run writes under:

<project>/.snowl/runs/<run_id>/

Important artifacts include:

  • manifest.json
  • plan.json
  • summary.json
  • aggregate.json
  • profiling.json
  • trials.jsonl
  • events.jsonl
  • run.log

Observability surfaces:

  • CLI: operator-friendly foreground progress/logging
  • Web monitor:
    • /: run gallery / operator board
    • /runs/[runId]: single-run workspace
    • /compare: secondary historical comparison view

Running runs are expected to become visible immediately. Snowl writes bootstrap artifacts early so the Web monitor can show planned trials, visible tasks, models, and progress before the run completes.

Examples

See examples/README.md for the convention used by official examples.

Development Checks

Python tests:

pytest -q

Runtime-focused:

pytest -q tests/test_eval_autodiscovery.py tests/test_runtime_controls_and_profiling.py tests/test_resource_scheduler.py tests/test_cli_eval.py

Synthetic scheduler benchmark:

python scripts/runtime_scheduler_benchmark.py

Web UI typecheck:

cd webui
npm run -s typecheck

Packaged install sanity:

pip install -e .

About

A safety evaluation framework for agents.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors