feat(skill): notte-browser v1.1 — eval-driven improvements#28
Closed
kalil-notte wants to merge 1 commit intomainfrom
Closed
feat(skill): notte-browser v1.1 — eval-driven improvements#28kalil-notte wants to merge 1 commit intomainfrom
kalil-notte wants to merge 1 commit intomainfrom
Conversation
…racy fixes Tested via automated eval harness across 23 CLI-coverage tasks on Claude Opus 4.6, Claude Opus 4.7, and Codex gpt-5.3-codex. v1.1 reduces Claude errors -22% and Codex errors -37% vs v1.0 baseline. Correctness rules (eval-driven): - "No top-level notte scrape" warning - "IDs are flags not positional" rule (--persona-id etc.) - form-fill JSON schema enum (full_name, email, etc.) - Function file contract (.py, sync def run, notte session constructor) - Cron examples fixed to 6-field AWS-style - Tab re-indexing after close-tab - Scroll-fail to keyboard fallback (PageDown/End) - Stale current-session trap + SID capture recovery pattern - notte files has no delete subcommand - graceful exit guidance - Flag-name traps (--instructions plural, sessions start no --url) - One-agent-per-session 409 note CLI accuracy fixes (deep --help verification): - --proxies to --proxy (singular, matches CLI) - Default timeout 30 to 60 - files list default is --downloads, not --uploads - Added 8 missing sessions start flags (--proxy-country, --profile-id, --vault-id, --profile-persist, --web-bot-auth, --screenshot-type, --chrome-args, --extra-http-headers) - Added missing session subcommands: code, offset, viewer - Added page eval-js to utilities - Added agents start flags: --url, --use-vision, --response-format-json, --session-offset - Added profiles section (was completely missing) - Fixed sessions replay description + --path flag - Added --path/--urls-only to sessions network - Added [output-path] to page screenshot - Added --var/--vars to functions run - Added --path to files download - Added --name to profiles create, --name/--page/--page-size to profiles list - Fixed --headless default (removed false default: true) - Added chrome-nightly to --browser-type options Token optimization: - Moved 95-line troubleshooting block to references/troubleshooting.md Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
c7be82b to
a4ffcdb
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Improved the
notte-browserClaude Code skill based on automated eval runs across 23 CLI-coverage tasks, then deep-verified every command and flag againstnotte --help.Results — v1.0 baseline vs v1.1
Claude v1.0 completed 22/23 tasks (1 FAIL on files_crud). All other cells: 23/23.
Cross-model comparison on v1.1 skill
Opus 4.7 unlocks the v1.1 skill's full value — notably
files_cruddrops from 978s/10 errs on 4.6 to 44s/1 err on 4.7 (22x faster). Codex remains the cheapest by ~10x and most efficient by a wide margin.Changes
Correctness rules (eval-driven):
CLI accuracy fixes (deep verification):
Token optimization:
Eval methodology
Test plan
Generated with Claude Code