Skip to content

v0.2.0: CLI, Python package, working e2e replay#1

Merged
pyyush merged 19 commits intomainfrom
fix/e2e-replay
Apr 2, 2026
Merged

v0.2.0: CLI, Python package, working e2e replay#1
pyyush merged 19 commits intomainfrom
fix/e2e-replay

Conversation

@pyyush
Copy link
Copy Markdown
Owner

@pyyush pyyush commented Mar 27, 2026

Summary

  • CLI: dbar replay --cost, dbar eval, dbar validate — honest cost model with verified figures
  • Python package: pip install dbar — native browser-use v0.12.5 integration via on_step_end hooks, zero runtime deps, 33 tests
  • E2E replay works: 5 core bugs fixed — capture → replay at 100% on 3 real websites (books.toscrape.com, example.com, quotes.toscrape.com)
  • Integrations rebuilt: browser-use (snapshot-only, honest about CDP limits) + Browserbase (@browserbasehq/sdk v2.6.0, full determinism)
  • All claims verified: cost figures from browser-use benchmark blog, qualified benchmark stats, security hardening (no --api-key CLI flags, masked CDP URLs, PII warnings)

E2E Results

Site Steps Requests Replay Success
books.toscrape.com 3 68 100%
example.com 1 1 100%
quotes.toscrape.com 2 12 100%

Core Fixes

  1. DOM hash non-deterministic — hash getOuterHTML instead of DOMSnapshot (layout-independent)
  2. Accessibility API removed — multi-strategy fallback (legacy → CDP → ariaSnapshot)
  3. Response bodies garbledhydrateTranscript() resolves deduplicated paths to base64
  4. Virtual time blocks navigation — defer virtualizer to first step(), add suspend()
  5. No multi-step replay — added ReplaySession with step-by-step API

Test plan

  • 206 TS unit tests passing
  • 33 Python unit tests passing
  • demo/e2e-test.ts — 3 real websites at 100% replay
  • Typecheck clean
  • Build succeeds
  • npm publish as v0.2.0
  • PyPI publish as v0.1.0

🤖 Generated with Claude Code

pyyush and others added 19 commits March 26, 2026 16:31
DBAR can now record and replay browser-use agent sessions. The bridge
connects to browser-use's running Playwright browser via CDP and wraps
the page with DBAR capture. File-based signaling (.dbar-step, .dbar-finish)
coordinates the Python agent process with the Node.js capture process.

Includes capture.ts, replay.ts, example.py, and supporting config.

Test: tsc --noEmit passes with zero errors

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Browserbase provides cloud browser infrastructure via CDP. This bridge
lets Browserbase users attach DBAR to their cloud sessions to produce
deterministic capsules that can be replayed locally.

Supports two connection modes:
- Browserbase API: session ID + API key resolves CDP URL automatically
- Direct CDP: raw WebSocket URL for manual connection

Follows the same architecture as the browser-use integration (file-based
signaling, same capsule format) adapted for Browserbase's REST API.

Test: typecheck passes (npx tsc --noEmit)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Adds `dbar` CLI binary with three commands:
- `dbar replay <capsule> --cost` — replay with cost comparison
- `dbar eval --capsules <dir> --assertions <yaml>` — batch evaluation
- `dbar validate <capsule>` — structural validation

Includes professional 4K demo page (demo/index.html) with animated
terminal, cost comparison widget, and integration showcase.

204 tests passing (171 existing + 33 new CLI tests).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Demo now shows real SHA-256 hashes from books.toscrape.com capture
(Chromium 145.0.7632.6), real browser-use cost numbers ($0.19/task
from their published blog), real market stats (84.6K stars, $17M,
$67.5M), and real divergence detection (DOM mismatch in 669ms).

Added "The Divergence" section showing DBAR catching timestamp
changes between runs — demonstrating the core value proposition.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Self-contained 90-second auto-playing HTML presentation for
screen recording. 7 scenes: title, problem, capture, replay,
value props, integrations, CTA. All real data from
books.toscrape.com capture session.

Pure CSS animation sequencing, JS only for pause/play/fullscreen.
Optimized for 1920x1080 Loom recording.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Automated Apple-keynote-quality demo that captures a real DBAR
session on books.toscrape.com with:
- Human-like mouse movement (cubic bezier curves)
- Live updating DBAR dashboard side panel
- 7-scene sequence: navigate, browse, view, cart, capsule, replay, CTA
- Real DBAR capture/replay with actual hashes and cost comparison

Run: npx tsx demo/record.ts
Record screen with QuickTime/Loom while it plays.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Cost claims audit findings:
- OUTPUT_TOKEN_RATIO 0.15 (agents output ~15% of input)
- Replay shows compute cost (~$0.001), not hardcoded $0.00
- "API savings" replaces misleading "100% savings"

Code quality: args.ts dead branch removed, YAML regex fixed,
eval.ts url_contains documented, process.exitCode not exit(0).

206 tests passing.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Demo claims (index.html, video.html):
- $0.19/task -> $0.10/task (verified from browser-use benchmark blog)
- $19/100 tasks -> $10/100 tasks
- "$0.00 replay" -> "$0 API cost" with compute footnote
- "84,600+" -> "84K+" (less likely to go stale)
- "97% accuracy" -> "97% on Online-Mind2Web" with WebVoyager note
- "0% can prove it" -> "no hash-verified proof"
- "669ms" -> "~670ms"
- Split "$84.5M combined funding" into separate figures

Integration honesty:
- browser-use README rewritten: honest about parallel-capture
  architecture, CDP Fetch conflict, limitations section
- Browserbase: removed --api-key CLI flag (env-var only)
- Browserbase: CDP URLs masked in log output
- --no-sandbox gated behind DBAR_NO_SANDBOX=1 env var
- PII security notice added to both integration READMEs

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
$0.19 -> $0.10, $0.00 -> $0 API, 100% -> ~99%.
Console.log statements are intentional CLI progress output.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Complete rewrite addressing review findings:

Architecture: Snapshot-only capture (NOT full determinism). Honest
about CDP Fetch conflict with browser-use's cdp-use client. DBAR
captures DOM/a11y/screenshot snapshots at step boundaries via raw
CDP, without interfering with the agent.

API fixes:
- Browser() constructor, not BrowserConfig (dropped in v0.12.5)
- on_step_end hook for step signaling (not file-only)
- browser-use uses cdp-use, not Playwright (documented)
- Removed replay.ts (snapshot-only mode can't replay)

Pinned: browser-use==0.12.5, cdp-use==1.4.5
11 unit tests for capture utilities.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Complete rewrite using official Browserbase SDK:

- session.connectUrl for CDP (not hand-rolled REST calls)
- DBAR OWNS the session — full deterministic capture works
  (virtual time + network recording + state snapshots)
- Secrets via env vars only (BROWSERBASE_API_KEY, PROJECT_ID)
- connectUrl masked in all log output (contains API key)
- Extracted helpers.ts for testable pure functions
- 17 unit tests for arg parsing, URL masking, manifest building

Pinned: @browserbasehq/sdk ^2.6.0

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
pip install dbar — three lines to record browser-use agent sessions:

  recorder = DBARRecorder(output_dir="./capsules")
  result = await agent.run(on_step_end=recorder.on_step_end)
  capsule = recorder.finish()

Records DOM hashes, screenshot hashes, actions, and results at
each step. Zero runtime deps. Capsule.diff() for comparing runs.
Sensitive data redaction by default.

33 Python tests. 206 TS tests unaffected.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
5 bugs found and fixed via e2e testing against 3 live sites:

1. DOM hash non-deterministic — hash getOuterHTML instead of
   DOMSnapshot (layout-independent, structural only)
2. Accessibility API removed — multi-strategy fallback
   (legacy → CDP getFullAXTree → ariaSnapshot)
3. Response bodies garbled — hydrateTranscript() resolves
   deduplicated body paths back to base64 before replay
4. Virtual time blocks navigation — defer virtualizer start
   to first step(), add suspend() for between-step nav
5. No multi-step replay — added ReplaySession with step-by-step
   API (startReplay → step → finish)

Also: added screenshot_mismatch divergence type.

e2e results (demo/e2e-test.ts):
- books.toscrape.com: 3 steps, 68 requests — 100%
- example.com: 1 step, 1 request — 100%
- quotes.toscrape.com: 2 steps, 12 requests — 100%

206 tests passing. Typecheck clean.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The ReplaySession.step() path was fixed but DBAR.replay() still
used "dom_mismatch" for screenshot divergences. Both paths now
correctly use "screenshot_mismatch".

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Removed:
- demo/index.html (static landing page with hardcoded data)
- demo/video.html (CSS animation presentation)

Kept:
- demo/e2e-test.ts (real capture+replay on 3 live sites)
- demo/record.ts (real Chrome recording with DBAR capture)
- demo/dashboard.html (live status panel for record.ts)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- package.json: 0.1.0 → 0.2.0
- python/pyproject.toml: 0.1.0 → 0.2.0
- python/dbar/_version.py: 0.1.0 → 0.2.0
- release.yml: added pypi job (hatchling build + twine upload)
  using secrets.PYPI_TOKEN

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…side-package publish

This narrows the public release surface before the follow-up changeset work.
The workflow now blocks tracked operator docs/state files, the README banner
uses a hosted asset that renders on npm, and integration packages are marked
non-publishable with an explicit prepublish failure.

Constraint: Keep this as a stacked prep PR on top of fix/e2e-replay without bundling unrelated dirty-worktree edits
Rejected: Commit the cleanup from the active dirty worktree directly | risked scooping unrelated uncommitted changes into the PR
Confidence: high
Scope-risk: narrow
Reversibility: clean
Directive: Keep integration packages private unless they gain a deliberate published-package contract and files allowlist
Tested: Root npm ci/build/typecheck/test, browser-use npm ci/typecheck/test, root npm pack --dry-run, integration npm publish --dry-run failure paths
Not-tested: Browserbase npm ci on this base branch; package-lock.json is out of sync before this prep change
This locks the browser-use and Browserbase integration surfaces to the
versions actually verified in this branch and makes those lanes visible in the
public README. The browser-use Python extra now reflects its real Python 3.11+
constraint, both private integration packages point at the local repo package,
and the docs state the exact versions users should expect.

Constraint: Land on fix/e2e-replay without pulling unrelated dirty-worktree edits from the active checkout
Rejected: Push the active checkout directly | risked bundling unrelated in-progress code and generated state
Confidence: high
Scope-risk: narrow
Reversibility: clean
Directive: Keep integration docs aligned with exact tested versions; if you loosen pins later, update lockfiles and wording from pinned to supported in the same change
Tested: Root npm ci/build/typecheck/test, browser-use npm ci/typecheck/test, browserbase npm ci/test, Python 3.12 uv install of ./python[browser-use,dev], python/tests
Not-tested: Python 3.11 runtime specifically on this machine; browser-use extra was validated on Python 3.12 because that interpreter was available
This replaces the tag-triggered publish path with a Changesets-driven release
workflow on main. The workflow now creates or updates a release PR when
changesets are present and only publishes npm plus PyPI after that version PR
has landed. A small Python version sync step keeps python/pyproject.toml in
lockstep with the package version, and the npm publish helper no-ops when the
current version is already on npm so normal main pushes do not republish.

Constraint: Keep release automation aligned with the existing single-package repo shape and the Python package version lockstep
Rejected: Keep tag-triggered publishing and add a main-branch ancestry check | still allowed branch-side release control and skipped the requested release-PR workflow
Confidence: high
Scope-risk: moderate
Reversibility: clean
Directive: Future releases should merge via the Changesets-generated version PR on main; do not push release tags by hand as the primary release path
Tested: Root npm run release:verify, npm run release:publish:npm no-op on already published 0.2.0, browser-use npm ci/typecheck/test, browserbase npm ci/test, Python 3.12 uv install of ./python[dev] and python/tests
Not-tested: End-to-end GitHub changesets/action execution in Actions; behavior is inferred from the official changesets/action contract and local script verification
@pyyush pyyush merged commit daa7ba8 into main Apr 2, 2026
2 checks passed
@pyyush pyyush deleted the fix/e2e-replay branch April 2, 2026 21:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant