feat(benchmarks): add Claude Code terminal vs NBI chat performance suite by pjdoland · Pull Request #350 · plmbr/notebook-intelligence

pjdoland · 2026-05-26T19:48:43Z

Summary

Adds a standalone benchmark suite under benchmarks/claude_perf/ that compares Claude Code response times between the terminal CLI (claude -p) and NBI's chat sidebar (WebSocket to Agent SDK). Useful for quantifying the overhead (or lack thereof) NBI's middleware adds to Claude Code requests.

Solution

Six files in benchmarks/claude_perf/:

bench_terminal.py: runs claude -p "prompt" --output-format stream-json --verbose, extracts ttft_ms, duration_ms, duration_api_ms, token counts, and cost directly from the CLI's own telemetry.
bench_nbi_ws.py: connects to NBI's WebSocket (/notebook-intelligence/copilot), sends a chat-request JSON, measures TTFT (time to first nbiContent markdown chunk) and wall time (to stream-end).
bench_runner.py: orchestrator. Runs N iterations per prompt through both paths, interleaves by default (terminal-NBI-terminal-NBI to cancel API load drift), separates cold/warm, prints a comparison table with warm-only medians.
report.py: re-reads results.json and prints formatted comparison tables with ASCII TTFT histograms.
prompts.py: prompt bank (short / medium / code / long).
stats.py: shared summarize() with corrected p95 formula.

Plus a .gitignore entry for results.json (contains cost data) and __pycache__/.

Design went through two rounds of six-agent review (three /simplify + performance engineer + test architecture + security). Key remediation items: extracted shared summarize(), fixed p95 formula, added try/except per run, interleaved design, warm-only comparison, output_chars on both paths, sys.path import robustness, stderr truncation in error results, methodology caveats in README.

Local benchmark results (25 iterations, interleaved, 3s cooldown)

Ran locally on macOS with Claude Code 2.1.150, claude-opus-4-7 model, NBI 5.0.1.

Warm-run comparison (median of 24 warm runs per prompt)

Prompt	Terminal TTFT	NBI TTFT	TTFT Delta	Terminal Wall	NBI Wall	Wall Delta
short ("What is 2+2?")	2215ms	1467ms	-34%	4990ms	1772ms	-65%
medium (list vs tuple)	3874ms	2354ms	-39%	6521ms	2532ms	-61%
code (LCS function)	5438ms	6131ms	+13%	8121ms	6577ms	-19%
long (merge vs quick sort)	9132ms	4541ms	-50%	11954ms	5050ms	-58%

Cold-start TTFT (first request in each block)

Prompt	Terminal Cold	NBI Cold
short	2123ms	3628ms
medium	3904ms	3127ms
code	5090ms	8176ms
long	9531ms	7570ms

Key findings

NBI is faster on warm requests. The terminal CLI pays Node.js subprocess startup, config/CLAUDE.md loading, and MCP server discovery on every claude -p invocation (~2-3s overhead). NBI's persistent Agent SDK client skips all of that.
Cold start is mixed. NBI's first request can be slower (Agent SDK handshake) or faster (prompt cache sharing from a prior session), depending on cache state.
The "code" prompt is the one exception for TTFT (+13% slower in NBI). NBI's response was also 78% longer in chars (1435 vs 804), suggesting Claude gave a more detailed answer through NBI's different system prompt, which explains the TTFT gap.
Wall time is 19-65% lower in NBI across all prompts. The subprocess lifecycle overhead the terminal pays is the dominant factor.
NBI's middleware adds no measurable latency beyond what the persistent Agent SDK client already eliminates. The data shows NBI is strictly faster than spawning claude -p per request for steady-state usage.

Methodology caveats (documented in README)

Different process models: terminal spawns a fresh subprocess per call; NBI uses a persistent server. Not an apples-to-apples comparison of API overhead, but representative of what users actually experience.
Different system prompts: the CLI and NBI load different system prompts, creating separate prompt-cache keys. First run in each path pays cache-creation cost.
API latency dominates: Claude API TTFT spans 800-10000ms depending on prompt complexity and load. The NBI/terminal overhead difference is a fraction of API variance.

Testing

No automated tests for the benchmark scripts (developer tools, not production code). Verified by running the full 25-iteration suite locally and confirming:

Both paths produce valid results for all four prompts
Errors are caught and logged without crashing the suite
Cold/warm tagging is correct
Interleaving alternates correctly
results.json is written and report.py reads it back

Risks / follow-ups

The NBI path currently lacks output token counts, cost, and model info (would require server-side instrumentation to expose Agent SDK usage through the WebSocket protocol). output_chars is the only cross-path metric measured identically.
A paired-difference analysis (Wilcoxon signed-rank test on the interleaved pairs) would give confidence intervals on the NBI overhead; currently the report shows independent medians, which wastes the pairing benefit.
Surfacing cache_creation_tokens in the report would flag warm runs that unexpectedly miss the prompt cache.

Standalone benchmark under `benchmarks/claude_perf/` that compares Claude Code response times between the terminal CLI (`claude -p`) and NBI's chat sidebar (WebSocket to Agent SDK). Measures TTFT, wall time, output tokens/chars across four prompts at configurable iteration count. Key design choices per two rounds of six-agent review: - Interleaved runs (terminal-NBI-terminal-NBI) to cancel API load drift - Warm-only medians in the comparison table; cold tagged separately - Shared `summarize()` in `stats.py` with corrected p95 formula - `output_chars` captured on both paths for apples-to-apples comparison - `sys.path.insert` for import robustness regardless of cwd - try/except around each run so transient failures don't crash the suite - `.gitignore` entries for results.json, __pycache__, screenshots - README documents methodology caveats (process model asymmetry, different system prompts, prompt cache confound, API latency dominance) No test for the benchmark scripts: they are developer tools, not production code, and the entry point is `python bench_runner.py`.

mbektas

thanks! great to see these results!

pjdoland added the enhancement New feature or request label May 26, 2026

chore: fix prettier formatting on benchmark README

61beca7

mbektas approved these changes May 27, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(benchmarks): add Claude Code terminal vs NBI chat performance suite#350

feat(benchmarks): add Claude Code terminal vs NBI chat performance suite#350
pjdoland wants to merge 2 commits into
plmbr:mainfrom
pjdoland:feat/benchmark-suite

pjdoland commented May 26, 2026

Uh oh!

mbektas left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

pjdoland commented May 26, 2026

Summary

Solution

Local benchmark results (25 iterations, interleaved, 3s cooldown)

Warm-run comparison (median of 24 warm runs per prompt)

Cold-start TTFT (first request in each block)

Key findings

Methodology caveats (documented in README)

Testing

Risks / follow-ups

Uh oh!

mbektas left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants