Skip to content

feat(skill): notte-browser v1.1 — eval-driven improvements#28

Closed
kalil-notte wants to merge 1 commit intomainfrom
skill/notte-browser-v1.1
Closed

feat(skill): notte-browser v1.1 — eval-driven improvements#28
kalil-notte wants to merge 1 commit intomainfrom
skill/notte-browser-v1.1

Conversation

@kalil-notte
Copy link
Copy Markdown
Collaborator

@kalil-notte kalil-notte commented Apr 16, 2026

Summary

Improved the notte-browser Claude Code skill based on automated eval runs across 23 CLI-coverage tasks, then deep-verified every command and flag against notte --help.

Results — v1.0 baseline vs v1.1

Metric Claude v1.0 Claude v1.1 Δ Codex v1.0 Codex v1.1 Δ
Errors 86 67 -22% 27 17 -37%
Time 4161s 3949s -5% 1607s 1422s -12%
Tool calls 502 489 -3% 169 93 -45%
Cost $15.39 $13.88 -10% $3.66 $2.33 -36%

Claude v1.0 completed 22/23 tasks (1 FAIL on files_crud). All other cells: 23/23.

Cross-model comparison on v1.1 skill

Metric Opus 4.6 Opus 4.7 Δ vs 4.6 Codex (gpt-5.3)
Wall time 3949s 2997s -24% 1422s
Tool calls 489 342 -30% 93
Errors 67 45 -33% 17
Cost $14.14 $32.18 +128% $2.33

Opus 4.7 unlocks the v1.1 skill's full value — notably files_crud drops from 978s/10 errs on 4.6 to 44s/1 err on 4.7 (22x faster). Codex remains the cheapest by ~10x and most efficient by a wide margin.

Changes

Correctness rules (eval-driven):

  • "No top-level notte scrape" warning
  • "IDs are flags not positional" rule
  • form-fill JSON schema enum (full_name, email, etc.)
  • Function file contract (.py, sync def run, session constructor)
  • Cron examples fixed to 6-field AWS-style
  • Tab re-indexing after close-tab
  • Scroll-fail to keyboard fallback (PageDown/End)
  • Stale current-session trap + SID capture recovery pattern
  • notte files has no delete subcommand - guided graceful exit
  • Flag-name traps (--instructions plural, sessions start no --url)
  • One-agent-per-session 409 note

CLI accuracy fixes (deep verification):

  • --proxies to --proxy (singular, matches CLI)
  • Default timeout 30 to 60
  • files list default is --downloads, not --uploads
  • Added 8 missing sessions start flags (--proxy-country, --profile-id, --vault-id, --profile-persist, --web-bot-auth, --screenshot-type, --chrome-args, --extra-http-headers)
  • Added missing session subcommands: code, offset, viewer
  • Added page eval-js to utilities
  • Added agents start flags: --url, --use-vision, --response-format-json, --session-offset
  • Added profiles section (was completely missing)
  • Fixed sessions replay description + --path flag
  • Added --path/--urls-only to sessions network
  • Added [output-path] to page screenshot
  • Added --var/--vars to functions run
  • Added --path to files download
  • Added --name to profiles create, --name/--page/--page-size to profiles list
  • Fixed --headless default (removed false default: true)
  • Added chrome-nightly to --browser-type options

Token optimization:

  • Moved 95-line troubleshooting block to references/troubleshooting.md

Eval methodology

  • 23 coverage tasks, one per CLI command group
  • Python harness: Claude Agent SDK + Codex CLI (codex exec --json, token counts from turn.completed events)
  • WebFetch/WebSearch blocked to isolate skill-only signal
  • Deep CLI verification: ran --help for every notte subcommand and cross-referenced SKILL.md
  • Multiple iterations with trace analysis to identify general patterns (not overfitting)

Test plan

  • v1.0 baseline on Claude Opus 4.6 (23 tasks)
  • v1.1 on Claude Opus 4.6 (23 tasks)
  • v1.1 on Claude Opus 4.7 (23 tasks)
  • v1.0 baseline on Codex gpt-5.3-codex (23 tasks)
  • v1.1 on Codex gpt-5.3-codex (23 tasks)
  • Regression analysis on worst-performing tasks (all variance, no skill defects)

Generated with Claude Code

…racy fixes

Tested via automated eval harness across 23 CLI-coverage tasks on Claude
Opus 4.6, Claude Opus 4.7, and Codex gpt-5.3-codex. v1.1 reduces Claude
errors -22% and Codex errors -37% vs v1.0 baseline.

Correctness rules (eval-driven):
- "No top-level notte scrape" warning
- "IDs are flags not positional" rule (--persona-id etc.)
- form-fill JSON schema enum (full_name, email, etc.)
- Function file contract (.py, sync def run, notte session constructor)
- Cron examples fixed to 6-field AWS-style
- Tab re-indexing after close-tab
- Scroll-fail to keyboard fallback (PageDown/End)
- Stale current-session trap + SID capture recovery pattern
- notte files has no delete subcommand - graceful exit guidance
- Flag-name traps (--instructions plural, sessions start no --url)
- One-agent-per-session 409 note

CLI accuracy fixes (deep --help verification):
- --proxies to --proxy (singular, matches CLI)
- Default timeout 30 to 60
- files list default is --downloads, not --uploads
- Added 8 missing sessions start flags (--proxy-country, --profile-id,
  --vault-id, --profile-persist, --web-bot-auth, --screenshot-type,
  --chrome-args, --extra-http-headers)
- Added missing session subcommands: code, offset, viewer
- Added page eval-js to utilities
- Added agents start flags: --url, --use-vision, --response-format-json,
  --session-offset
- Added profiles section (was completely missing)
- Fixed sessions replay description + --path flag
- Added --path/--urls-only to sessions network
- Added [output-path] to page screenshot
- Added --var/--vars to functions run
- Added --path to files download
- Added --name to profiles create, --name/--page/--page-size to profiles list
- Fixed --headless default (removed false default: true)
- Added chrome-nightly to --browser-type options

Token optimization:
- Moved 95-line troubleshooting block to references/troubleshooting.md

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@kalil-notte kalil-notte force-pushed the skill/notte-browser-v1.1 branch from c7be82b to a4ffcdb Compare April 17, 2026 13:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant