feat(benchmarks): add Claude Code terminal vs NBI chat performance suite#350
Open
pjdoland wants to merge 2 commits into
Open
feat(benchmarks): add Claude Code terminal vs NBI chat performance suite#350pjdoland wants to merge 2 commits into
pjdoland wants to merge 2 commits into
Conversation
Standalone benchmark under `benchmarks/claude_perf/` that compares Claude Code response times between the terminal CLI (`claude -p`) and NBI's chat sidebar (WebSocket to Agent SDK). Measures TTFT, wall time, output tokens/chars across four prompts at configurable iteration count. Key design choices per two rounds of six-agent review: - Interleaved runs (terminal-NBI-terminal-NBI) to cancel API load drift - Warm-only medians in the comparison table; cold tagged separately - Shared `summarize()` in `stats.py` with corrected p95 formula - `output_chars` captured on both paths for apples-to-apples comparison - `sys.path.insert` for import robustness regardless of cwd - try/except around each run so transient failures don't crash the suite - `.gitignore` entries for results.json, __pycache__, screenshots - README documents methodology caveats (process model asymmetry, different system prompts, prompt cache confound, API latency dominance) No test for the benchmark scripts: they are developer tools, not production code, and the entry point is `python bench_runner.py`.
mbektas
approved these changes
May 27, 2026
Collaborator
mbektas
left a comment
There was a problem hiding this comment.
thanks! great to see these results!
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds a standalone benchmark suite under
benchmarks/claude_perf/that compares Claude Code response times between the terminal CLI (claude -p) and NBI's chat sidebar (WebSocket to Agent SDK). Useful for quantifying the overhead (or lack thereof) NBI's middleware adds to Claude Code requests.Solution
Six files in
benchmarks/claude_perf/:bench_terminal.py: runsclaude -p "prompt" --output-format stream-json --verbose, extractsttft_ms,duration_ms,duration_api_ms, token counts, and cost directly from the CLI's own telemetry.bench_nbi_ws.py: connects to NBI's WebSocket (/notebook-intelligence/copilot), sends a chat-request JSON, measures TTFT (time to firstnbiContentmarkdown chunk) and wall time (tostream-end).bench_runner.py: orchestrator. Runs N iterations per prompt through both paths, interleaves by default (terminal-NBI-terminal-NBI to cancel API load drift), separates cold/warm, prints a comparison table with warm-only medians.report.py: re-readsresults.jsonand prints formatted comparison tables with ASCII TTFT histograms.prompts.py: prompt bank (short / medium / code / long).stats.py: sharedsummarize()with corrected p95 formula.Plus a
.gitignoreentry forresults.json(contains cost data) and__pycache__/.Design went through two rounds of six-agent review (three /simplify + performance engineer + test architecture + security). Key remediation items: extracted shared
summarize(), fixed p95 formula, added try/except per run, interleaved design, warm-only comparison,output_charson both paths,sys.pathimport robustness, stderr truncation in error results, methodology caveats in README.Local benchmark results (25 iterations, interleaved, 3s cooldown)
Ran locally on macOS with Claude Code 2.1.150, claude-opus-4-7 model, NBI 5.0.1.
Warm-run comparison (median of 24 warm runs per prompt)
Cold-start TTFT (first request in each block)
Key findings
claude -pinvocation (~2-3s overhead). NBI's persistent Agent SDK client skips all of that.claude -pper request for steady-state usage.Methodology caveats (documented in README)
Testing
No automated tests for the benchmark scripts (developer tools, not production code). Verified by running the full 25-iteration suite locally and confirming:
results.jsonis written andreport.pyreads it backRisks / follow-ups
output_charsis the only cross-path metric measured identically.cache_creation_tokensin the report would flag warm runs that unexpectedly miss the prompt cache.