Skip to content

feat(benchmarks): add Claude Code terminal vs NBI chat performance suite#350

Open
pjdoland wants to merge 2 commits into
plmbr:mainfrom
pjdoland:feat/benchmark-suite
Open

feat(benchmarks): add Claude Code terminal vs NBI chat performance suite#350
pjdoland wants to merge 2 commits into
plmbr:mainfrom
pjdoland:feat/benchmark-suite

Conversation

@pjdoland
Copy link
Copy Markdown
Collaborator

Summary

Adds a standalone benchmark suite under benchmarks/claude_perf/ that compares Claude Code response times between the terminal CLI (claude -p) and NBI's chat sidebar (WebSocket to Agent SDK). Useful for quantifying the overhead (or lack thereof) NBI's middleware adds to Claude Code requests.

Solution

Six files in benchmarks/claude_perf/:

  • bench_terminal.py: runs claude -p "prompt" --output-format stream-json --verbose, extracts ttft_ms, duration_ms, duration_api_ms, token counts, and cost directly from the CLI's own telemetry.
  • bench_nbi_ws.py: connects to NBI's WebSocket (/notebook-intelligence/copilot), sends a chat-request JSON, measures TTFT (time to first nbiContent markdown chunk) and wall time (to stream-end).
  • bench_runner.py: orchestrator. Runs N iterations per prompt through both paths, interleaves by default (terminal-NBI-terminal-NBI to cancel API load drift), separates cold/warm, prints a comparison table with warm-only medians.
  • report.py: re-reads results.json and prints formatted comparison tables with ASCII TTFT histograms.
  • prompts.py: prompt bank (short / medium / code / long).
  • stats.py: shared summarize() with corrected p95 formula.

Plus a .gitignore entry for results.json (contains cost data) and __pycache__/.

Design went through two rounds of six-agent review (three /simplify + performance engineer + test architecture + security). Key remediation items: extracted shared summarize(), fixed p95 formula, added try/except per run, interleaved design, warm-only comparison, output_chars on both paths, sys.path import robustness, stderr truncation in error results, methodology caveats in README.

Local benchmark results (25 iterations, interleaved, 3s cooldown)

Ran locally on macOS with Claude Code 2.1.150, claude-opus-4-7 model, NBI 5.0.1.

Warm-run comparison (median of 24 warm runs per prompt)

Prompt Terminal TTFT NBI TTFT TTFT Delta Terminal Wall NBI Wall Wall Delta
short ("What is 2+2?") 2215ms 1467ms -34% 4990ms 1772ms -65%
medium (list vs tuple) 3874ms 2354ms -39% 6521ms 2532ms -61%
code (LCS function) 5438ms 6131ms +13% 8121ms 6577ms -19%
long (merge vs quick sort) 9132ms 4541ms -50% 11954ms 5050ms -58%

Cold-start TTFT (first request in each block)

Prompt Terminal Cold NBI Cold
short 2123ms 3628ms
medium 3904ms 3127ms
code 5090ms 8176ms
long 9531ms 7570ms

Key findings

  • NBI is faster on warm requests. The terminal CLI pays Node.js subprocess startup, config/CLAUDE.md loading, and MCP server discovery on every claude -p invocation (~2-3s overhead). NBI's persistent Agent SDK client skips all of that.
  • Cold start is mixed. NBI's first request can be slower (Agent SDK handshake) or faster (prompt cache sharing from a prior session), depending on cache state.
  • The "code" prompt is the one exception for TTFT (+13% slower in NBI). NBI's response was also 78% longer in chars (1435 vs 804), suggesting Claude gave a more detailed answer through NBI's different system prompt, which explains the TTFT gap.
  • Wall time is 19-65% lower in NBI across all prompts. The subprocess lifecycle overhead the terminal pays is the dominant factor.
  • NBI's middleware adds no measurable latency beyond what the persistent Agent SDK client already eliminates. The data shows NBI is strictly faster than spawning claude -p per request for steady-state usage.

Methodology caveats (documented in README)

  • Different process models: terminal spawns a fresh subprocess per call; NBI uses a persistent server. Not an apples-to-apples comparison of API overhead, but representative of what users actually experience.
  • Different system prompts: the CLI and NBI load different system prompts, creating separate prompt-cache keys. First run in each path pays cache-creation cost.
  • API latency dominates: Claude API TTFT spans 800-10000ms depending on prompt complexity and load. The NBI/terminal overhead difference is a fraction of API variance.

Testing

No automated tests for the benchmark scripts (developer tools, not production code). Verified by running the full 25-iteration suite locally and confirming:

  • Both paths produce valid results for all four prompts
  • Errors are caught and logged without crashing the suite
  • Cold/warm tagging is correct
  • Interleaving alternates correctly
  • results.json is written and report.py reads it back

Risks / follow-ups

  • The NBI path currently lacks output token counts, cost, and model info (would require server-side instrumentation to expose Agent SDK usage through the WebSocket protocol). output_chars is the only cross-path metric measured identically.
  • A paired-difference analysis (Wilcoxon signed-rank test on the interleaved pairs) would give confidence intervals on the NBI overhead; currently the report shows independent medians, which wastes the pairing benefit.
  • Surfacing cache_creation_tokens in the report would flag warm runs that unexpectedly miss the prompt cache.

Standalone benchmark under `benchmarks/claude_perf/` that compares
Claude Code response times between the terminal CLI (`claude -p`) and
NBI's chat sidebar (WebSocket to Agent SDK). Measures TTFT, wall time,
output tokens/chars across four prompts at configurable iteration count.

Key design choices per two rounds of six-agent review:

- Interleaved runs (terminal-NBI-terminal-NBI) to cancel API load drift
- Warm-only medians in the comparison table; cold tagged separately
- Shared `summarize()` in `stats.py` with corrected p95 formula
- `output_chars` captured on both paths for apples-to-apples comparison
- `sys.path.insert` for import robustness regardless of cwd
- try/except around each run so transient failures don't crash the suite
- `.gitignore` entries for results.json, __pycache__, screenshots
- README documents methodology caveats (process model asymmetry,
  different system prompts, prompt cache confound, API latency dominance)

No test for the benchmark scripts: they are developer tools, not
production code, and the entry point is `python bench_runner.py`.
@pjdoland pjdoland added the enhancement New feature or request label May 26, 2026
Copy link
Copy Markdown
Collaborator

@mbektas mbektas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks! great to see these results!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants