feat(skill): notte-browser v1.1 — eval-driven improvements by kalil-notte · Pull Request #28 · nottelabs/notte-cli

kalil-notte · 2026-04-16T15:04:36Z

Summary

Improved the notte-browser Claude Code skill based on automated eval runs across 23 CLI-coverage tasks, then deep-verified every command and flag against notte --help.

Results — v1.0 baseline vs v1.1

Metric	Claude v1.0	Claude v1.1	Δ	Codex v1.0	Codex v1.1	Δ
Errors	86	67	-22%	27	17	-37%
Time	4161s	3949s	-5%	1607s	1422s	-12%
Tool calls	502	489	-3%	169	93	-45%
Cost	$15.39	$13.88	-10%	$3.66	$2.33	-36%

Claude v1.0 completed 22/23 tasks (1 FAIL on files_crud). All other cells: 23/23.

Cross-model comparison on v1.1 skill

Metric	Opus 4.6	Opus 4.7	Δ vs 4.6	Codex (gpt-5.3)
Wall time	3949s	2997s	-24%	1422s
Tool calls	489	342	-30%	93
Errors	67	45	-33%	17
Cost	$14.14	$32.18	+128%	$2.33

Opus 4.7 unlocks the v1.1 skill's full value — notably files_crud drops from 978s/10 errs on 4.6 to 44s/1 err on 4.7 (22x faster). Codex remains the cheapest by ~10x and most efficient by a wide margin.

Changes

Correctness rules (eval-driven):

"No top-level notte scrape" warning
"IDs are flags not positional" rule
form-fill JSON schema enum (full_name, email, etc.)
Function file contract (.py, sync def run, session constructor)
Cron examples fixed to 6-field AWS-style
Tab re-indexing after close-tab
Scroll-fail to keyboard fallback (PageDown/End)
Stale current-session trap + SID capture recovery pattern
notte files has no delete subcommand - guided graceful exit
Flag-name traps (--instructions plural, sessions start no --url)
One-agent-per-session 409 note

CLI accuracy fixes (deep verification):

--proxies to --proxy (singular, matches CLI)
Default timeout 30 to 60
files list default is --downloads, not --uploads
Added 8 missing sessions start flags (--proxy-country, --profile-id, --vault-id, --profile-persist, --web-bot-auth, --screenshot-type, --chrome-args, --extra-http-headers)
Added missing session subcommands: code, offset, viewer
Added page eval-js to utilities
Added agents start flags: --url, --use-vision, --response-format-json, --session-offset
Added profiles section (was completely missing)
Fixed sessions replay description + --path flag
Added --path/--urls-only to sessions network
Added [output-path] to page screenshot
Added --var/--vars to functions run
Added --path to files download
Added --name to profiles create, --name/--page/--page-size to profiles list
Fixed --headless default (removed false default: true)
Added chrome-nightly to --browser-type options

Token optimization:

Moved 95-line troubleshooting block to references/troubleshooting.md

Eval methodology

23 coverage tasks, one per CLI command group
Python harness: Claude Agent SDK + Codex CLI (codex exec --json, token counts from turn.completed events)
WebFetch/WebSearch blocked to isolate skill-only signal
Deep CLI verification: ran --help for every notte subcommand and cross-referenced SKILL.md
Multiple iterations with trace analysis to identify general patterns (not overfitting)

Test plan

v1.0 baseline on Claude Opus 4.6 (23 tasks)
v1.1 on Claude Opus 4.6 (23 tasks)
v1.1 on Claude Opus 4.7 (23 tasks)
v1.0 baseline on Codex gpt-5.3-codex (23 tasks)
v1.1 on Codex gpt-5.3-codex (23 tasks)
Regression analysis on worst-performing tasks (all variance, no skill defects)

Generated with Claude Code

…racy fixes Tested via automated eval harness across 23 CLI-coverage tasks on Claude Opus 4.6, Claude Opus 4.7, and Codex gpt-5.3-codex. v1.1 reduces Claude errors -22% and Codex errors -37% vs v1.0 baseline. Correctness rules (eval-driven): - "No top-level notte scrape" warning - "IDs are flags not positional" rule (--persona-id etc.) - form-fill JSON schema enum (full_name, email, etc.) - Function file contract (.py, sync def run, notte session constructor) - Cron examples fixed to 6-field AWS-style - Tab re-indexing after close-tab - Scroll-fail to keyboard fallback (PageDown/End) - Stale current-session trap + SID capture recovery pattern - notte files has no delete subcommand - graceful exit guidance - Flag-name traps (--instructions plural, sessions start no --url) - One-agent-per-session 409 note CLI accuracy fixes (deep --help verification): - --proxies to --proxy (singular, matches CLI) - Default timeout 30 to 60 - files list default is --downloads, not --uploads - Added 8 missing sessions start flags (--proxy-country, --profile-id, --vault-id, --profile-persist, --web-bot-auth, --screenshot-type, --chrome-args, --extra-http-headers) - Added missing session subcommands: code, offset, viewer - Added page eval-js to utilities - Added agents start flags: --url, --use-vision, --response-format-json, --session-offset - Added profiles section (was completely missing) - Fixed sessions replay description + --path flag - Added --path/--urls-only to sessions network - Added [output-path] to page screenshot - Added --var/--vars to functions run - Added --path to files download - Added --name to profiles create, --name/--page/--page-size to profiles list - Fixed --headless default (removed false default: true) - Added chrome-nightly to --browser-type options Token optimization: - Moved 95-line troubleshooting block to references/troubleshooting.md Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

kalil-notte force-pushed the skill/notte-browser-v1.1 branch from c7be82b to a4ffcdb Compare April 17, 2026 13:42

kalil-notte closed this Apr 17, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(skill): notte-browser v1.1 — eval-driven improvements#28

feat(skill): notte-browser v1.1 — eval-driven improvements#28
kalil-notte wants to merge 1 commit intomainfrom
skill/notte-browser-v1.1

kalil-notte commented Apr 16, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

kalil-notte commented Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Results — v1.0 baseline vs v1.1

Cross-model comparison on v1.1 skill

Changes

Eval methodology

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

kalil-notte commented Apr 16, 2026 •

edited

Loading