[WIP][experimental] add agentic trace replay benchmark infrastructure by cquil11 · Pull Request #993 · SemiAnalysisAI/InferenceX

cquil11 · 2026-04-01T20:25:38Z

Trace replay benchmarking for agentic coding workloads using real Claude Code traces. Includes:

Trace replay scripts for H200, MI355X, B200 (vLLM-based)
kv-cache-tester submodule (trace replayer + 522 anonymized traces)
AIPerf submodule (alternative synthetic benchmarking)
Pareto frontier plotting and sweep aggregation
Metrics collector (prometheus scraper + visualization)
Workload distribution analysis
GitHub Actions workflow with per-TP sweep configs
MI355X runner SCRIPT_SUFFIX support

Trace replay benchmarking for agentic coding workloads using real Claude Code traces. Includes: - Trace replay scripts for H200, MI355X, B200 (vLLM-based) - kv-cache-tester submodule (trace replayer + 522 anonymized traces) - AIPerf submodule (alternative synthetic benchmarking) - Pareto frontier plotting and sweep aggregation - Metrics collector (prometheus scraper + visualization) - Workload distribution analysis - GitHub Actions workflow with per-TP sweep configs - MI355X runner SCRIPT_SUFFIX support Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

github-actions · 2026-04-01T20:25:51Z

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

+    runs-on: ubuntu-latest
+    outputs:
+      matrix: ${{ steps.gen.outputs.matrix }}
+    steps:
+      - uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2
+        if: ${{ inputs.config_file != '' }}
+        with:
+          token: ${{ secrets.REPO_PAT }}
+          fetch-depth: 1
+          ref: ${{ inputs.ref || github.ref }}
+          sparse-checkout: ${{ inputs.config_file }}
+
+      - id: gen
+        run: |
+          pip install -q pyyaml
+          python3 << 'PYEOF'
+          import json, os, sys
+
+          config_file = "${{ inputs.config_file }}".strip()
+
+          if config_file:
+              import yaml
+              with open(config_file) as f:
+                  full_config = yaml.safe_load(f)
+
+              config_key = "${{ inputs.config_key }}".strip()
+
+              # If config_key specified, use that section; otherwise auto-detect
+              if config_key and config_key in full_config:
+                  config = full_config[config_key]
+              elif config_key:
+                  print(f"ERROR: config_key '{config_key}' not found. Available: {list(full_config.keys())}")
+                  sys.exit(1)
+              elif len(full_config) == 1:
+                  config = next(iter(full_config.values()))
+              else:
+                  # Check if top-level keys look like tp entries (tp2, tp4, etc.)
+                  if all(k.startswith("tp") for k in full_config):
+                      config = full_config
+                  else:
+                      print(f"ERROR: Multiple entries in config, specify --config_key. Available: {list(full_config.keys())}")
+                      sys.exit(1)
+
+              includes = []
+              for key, settings in config.items():
+                  tp = int(key.replace("tp", ""))
+                  users = settings.get("users", [])
+                  offloads = settings.get("offload", ["on", "off"])
+                  ep = settings.get("ep", 0)
+                  for u in users:
+                      for o in offloads:
+                          entry = {"tp": tp, "users": u, "offload": o}
+                          if ep > 0:
+                              entry["ep"] = ep
+                          includes.append(entry)
+          else:
+              tp_values = json.loads('${{ inputs.tp_values }}')
+              user_values = json.loads('${{ inputs.user_values }}')
+              offload_values = json.loads('${{ inputs.offload_values }}')
+              includes = []
+              for tp in tp_values:
+                  for u in user_values:
+                      for o in offload_values:
+                          includes.append({"tp": tp, "users": u, "offload": o})
+
+          matrix = {"include": includes}
+          print(f"Generated {len(includes)} matrix entries")
+          with open(os.environ["GITHUB_OUTPUT"], "a") as f:
+              f.write(f"matrix={json.dumps(matrix)}\n")
+          PYEOF
+
+  # ---------------------------------------------------------------------------
+  # Matrix benchmark jobs — each cell calls the multiturn template
+  # ---------------------------------------------------------------------------
+  sweep:


+    needs: generate-matrix
+    uses: ./.github/workflows/benchmark-multiturn-tmpl.yml
+    name: sweep /
+    strategy:
+      fail-fast: false
+      matrix: ${{ fromJson(needs.generate-matrix.outputs.matrix) }}
+    secrets: inherit
+    with:
+      runner: ${{ inputs.runner }}
+      image: ${{ inputs.image }}
+      model: ${{ inputs.model }}
+      precision: ${{ inputs.precision }}
+      exp-name: "multiturn_tp${{ matrix.tp }}_users${{ matrix.users }}_offload${{ matrix.offload }}"
+      tp: "${{ matrix.tp }}"
+      users: "${{ matrix.users }}"
+      offload-mode: ${{ matrix.offload }}
+      duration: ${{ inputs.duration }}
+      request-rate: ${{ inputs.request_rate }}
+      total-cpu-dram-gb: ${{ inputs.total_cpu_dram_gb }}
+      script-suffix: ${{ inputs.script_suffix }}
+      ep: "${{ matrix.ep || inputs.ep }}"
+      ref: ${{ inputs.ref }}
+
+  # ---------------------------------------------------------------------------
+  # Collect & aggregate results
+  # ---------------------------------------------------------------------------
+  collect:


+    runs-on: ubuntu-latest
+    needs: sweep
+    if: always()
+    name: Collect results
+    steps:
+      - uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2
+        with:
+          token: ${{ secrets.REPO_PAT }}
+          fetch-depth: 1
+          ref: ${{ inputs.ref || github.ref }}
+
+      - uses: actions/setup-python@v5
+        with:
+          python-version: '3.11'
+
+      - name: Install dependencies
+        run: pip install pandas matplotlib numpy
+
+      - name: Download all artifacts
+        uses: actions/download-artifact@v4
+        with:
+          pattern: 'multiturn_*'
+          path: results/
+
+      - name: Run aggregation
+        run: |
+          python experimental/multiturn/vllm_benchmark/scripts/collect_sweep_results.py results/ aggregated/
+
+      - name: Upload aggregated results
+        uses: actions/upload-artifact@b7c566a772e6b6bfb58ed0dc250532a479d7789f # v6.0.0
+        with:
+          name: multiturn_aggregated
+          path: aggregated/


Replaced by vLLM's native kv_offload metrics. Removes subprocess/threading imports and ~100 lines of dead code. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Add VLLMMetricsParser and SGLangMetricsParser with shared MetricsSnapshot. Backend is auto-detected from metrics prefix (vllm: vs sglang:) on first poll. sglang metrics mapped: - token_usage / num_used_tokens → kv_cache_usage - num_running_reqs → num_requests_running - num_queue_reqs → num_requests_waiting - cache_hit_rate × prompt_tokens → prefix_cache_hits/queries - num_retracted_reqs → num_preemptions - realtime_tokens_total mode=prefill_compute/prefill_cache → token source Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Replays SWE-bench/GAIA/WildClaw traces from sammshen/lmcache-agentic-traces via AIPerf with mooncake_trace format. Downloads and converts traces at runtime. Supports concurrency sweep with offload on/off. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Add --fixed-schedule to replay at exact trace timestamps - Remove --extra-inputs ignore_eos:true (let model stop naturally) - Remove unused REQUEST_RATE logic Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…tion

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…cessing Drops ~18GB per artifact by excluding inputs.json, conversations.jsonl, responses.json, GPU telemetry, raw records, and full aiperf_artifacts/. Only uploads the specific files used by collect_sweep_results.py and plot_pareto.py. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The profile_export.jsonl with 233K records was ~10GB per artifact. Switch collect_sweep_results.py and plot_pareto.py to read from the pre-computed profile_export_aiperf.csv (~4KB) instead. Remove the JSONL from the artifact upload. Existing client CSV and trace_replay paths are unchanged. Also exclude low-FreeMem H100 nodes (1, 7, 18) to avoid cudaMallocHost/mlock failures during vLLM CPU KV cache allocation. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

vLLM v0.18.0 follows the newer OpenAI API spec where the 'system' message role was renamed to 'developer'. The LMCache traces use 'system', causing 100% 400 Bad Request errors. Also drop the 15GB profile_export_aiperf.json from artifact uploads. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The LMCache traces include explicit null values for optional fields (tool_calls, tool_call_id, name) on every message. vLLM's strict Pydantic validation rejects these, causing 100% HTTP 400 errors. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

VLLM_USE_SIMPLE_KV_OFFLOAD=1 routes to SimpleCPUOffloadConnector which imports cuda.bindings (NVIDIA-only, PR vllm-project/vllm#37160). Remove it from MI355X scripts so native offloading uses the ROCm-safe OffloadingConnector. Also update H200 trace dir to use traces_neon with env-var override to match the other trace replay scripts. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Brings in curated v8 trace set, rate limiting metrics (goodput, effective TTFT, SLO tracking), and updated trace data. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Nodes define GRES with GPU subtypes (gpu:h100:8, gpu:h200:8), so salloc must request gpu:h100:N / gpu:h200:N instead of generic gpu:N. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Plumbs TRACE_DIR through sweep workflow → template → benchmark script. Accepts relative dir name (e.g. 'traces') or absolute path. Defaults to traces_neon when empty. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Only pulled trace data files (curated v8 set), no code changes. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

SimpleCPUOffloadConnector uses cuda.bindings (NVIDIA-only). MI355X must use --disable-hybrid-kv-cache-manager with the native OffloadingConnector. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…races Add ISB-1 (Inference Stress Benchmark) — a multi-turn, long-context KV cache stress testing dataset for InferenceX V3. ## What this adds **35 synthetic multi-turn traces** across 7 context bands (8K → 1M+ tokens): - 6 workload families: long_chat, coding, agent, rag, cache_stress, multimodal - KV stress patterns: prefix reuse, offload cliff, compaction, reactivation, fanout - Real conversation content with 60-95% prefix overlap (enables prefix cache testing) - Context assets from 15KB to 6.6MB inlined into traces for honest token counts **Export bundles** for vLLM + SGLang replay: - extension_131k: DeepSeek-R1, GPT-OSS, Qwen 3.5 (H200/B200) - preview/long_context_500k: Qwen 3.5 500K context stress test - preview/long_context_1m: Qwen 3.5 1M context stress test **10 KV stress sweep configs** (isb1-kv-stress-pr993.yaml): - 3 models × 2 GPUs × 2 engines - Sweep: 2→256 concurrent users × on/off/noprefix offload modes × 1800s ## Coexistence with kv-cache-tester This dataset complements PR SemiAnalysisAI#993's kv-cache-tester (522 real Claude Code traces): - kv-cache-tester: real workload distribution, natural performance profile - ISB1: controlled KV stress patterns that force offload cliffs and cache pressure No files in experimental/multiturn/ are modified. Separate config files, separate data directory (datasets/isb1/), shared replay infrastructure. ## Benchmark infrastructure - benchmark_export_replay.py: replay harness with actual_context_len telemetry - process_result_isb1.py: result aggregation with KV metrics - Prometheus metrics: kv_cache_usage, prefix_cache_hits, kv_offload_bytes - Pareto frontier: throughput vs p99 TTFT at each concurrency level ## Why this matters (from GTC 2026) > "Right now the benchmarks are kind of showing the worst the chips will > actually perform... for V3 we want to add agentic benchmarks like really > good representative multi-turn QA chat benchmarks where there are a ton > of client sessions each with multiple turns and we'll enable prefix caching." > — Cameron Quilici Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Workflows: new hash_block_mode input on multiturn-sweep and benchmark-multiturn-tmpl, threaded into the trace_replay_tester via HASH_BLOCK_MODE env (default false, existing runs unchanged). - Benchmark scripts (b200/h200/mi355x): TRACE_DIR prefixed with "hf_" now loads from a Hugging Face dataset — e.g. hf_semianalysisai--cc-traces-0 maps to --hf-dataset semianalysisai/cc-traces-0. Otherwise behaves as before with --trace-directory. - Bump kv-cache-tester submodule to pick up --hash-block-mode and --hf-dataset support. - simulate_hash_block_mode.py: dry-run simulator matching the schema of neon_trace_simulation.json. Reports prefix-cache hit estimate and infinite-set upper bound; aggregate mode runs across a directory.

+        raw = json.load(f)
+
+    block_size = raw.get("block_size", 64)
+    trace_id = raw.get("id", trace_path.stem)


…nite cache) to assessment periods

…races Add ISB-1 (Inference Stress Benchmark) — a multi-turn, long-context KV cache stress testing dataset for InferenceX V3. ## What this adds **35 synthetic multi-turn traces** across 7 context bands (8K → 1M+ tokens): - 6 workload families: long_chat, coding, agent, rag, cache_stress, multimodal - KV stress patterns: prefix reuse, offload cliff, compaction, reactivation, fanout - Real conversation content with 60-95% prefix overlap (enables prefix cache testing) - Context assets from 15KB to 6.6MB inlined into traces for honest token counts **Export bundles** for vLLM + SGLang replay: - extension_131k: DeepSeek-R1, GPT-OSS, Qwen 3.5 (H200/B200) - preview/long_context_500k: Qwen 3.5 500K context stress test - preview/long_context_1m: Qwen 3.5 1M context stress test **10 KV stress sweep configs** (isb1-kv-stress-pr993.yaml): - 3 models × 2 GPUs × 2 engines - Sweep: 2→256 concurrent users × on/off/noprefix offload modes × 1800s ## Coexistence with kv-cache-tester This dataset complements PR SemiAnalysisAI#993's kv-cache-tester (522 real Claude Code traces): - kv-cache-tester: real workload distribution, natural performance profile - ISB1: controlled KV stress patterns that force offload cliffs and cache pressure No files in experimental/multiturn/ are modified. Separate config files, separate data directory (datasets/isb1/), shared replay infrastructure. ## Benchmark infrastructure - benchmark_export_replay.py: replay harness with actual_context_len telemetry - process_result_isb1.py: result aggregation with KV metrics - Prometheus metrics: kv_cache_usage, prefix_cache_hits, kv_offload_bytes - Pareto frontier: throughput vs p99 TTFT at each concurrency level

… clean support vocabulary README.md: - Remove dead links to docs removed in 5f6aba7 (COVERAGE_AUDIT, LONG_CONTEXT_TRUTH_MATRIX, SUPPORT_MATRIX, RUNBOOKs, INVESTIGATION) - Replace stale 50-export-files count with post-flatten per-subtree inventory (23 bundles + 3 manifests = 26 total, consolidating framework-specific variants into flat single files) - Add explicit five-class support-status vocabulary section - Keep safe/unsafe claim boundary COEXISTENCE_WITH_KV_CACHE_TESTER.md: - Strip planning/negotiation sections (Recommended PR Structure and maintainer-request list) — not coexistence-technical - Replace possessive references with PR-number references throughout (kv-cache-tester -> PR SemiAnalysisAI#993, ISB1 -> PR SemiAnalysisAI#1032) - Update data-directory layout to show flat paths - Update ISB1 workflow name to run-isb1-kv-stress-sweep.yml - Add support-status vocabulary section GMI_EXECUTION_PLAN.md: - Prepend support-status framing (reviewed_preview, dataset_replay_verified, not live-serving certification) - Fix stale nested paths to flat: extension_131k/vllm/ -> extension_131k/ - Fix preview bundle names: strip __vllm/__sglang suffixes - Update final result-pipeline sentence to cite actual analyzer scripts

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Add --debug-trace flag to trace replay tester (stores full request/response bodies including reasoning_content to JSONL) - Plumb debug_trace through GHA workflows (multiturn-sweep.yml, benchmark-multiturn-tmpl.yml) and all 4 benchmark scripts - Add b200-fp4-dsr1-weka-trace-debug config (tp4, 2 users, offload off) - Add flamegraph generator script for visualizing per-trace cache hit/miss patterns as icicle charts - Bump kv-cache-tester submodule Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

+
+import argparse
+import json
+import os


Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The metrics collector starts ~2 minutes before the trace replay sends its first request. Strip rows with zero activity and reset relative_time so the CSV starts at first actual usage. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Plumb no_max_tokens through GHA workflows and all benchmark scripts. When enabled, the trace replayer doesn't enforce max_tokens from the trace, letting models like DeepSeek R1 generate freely so they can close <think> blocks and produce visible output. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…v-cache-tester (PR SemiAnalysisAI#993) PR SemiAnalysisAI#1032 keeps only data + the conversion shim; Cam's kv-cache-tester submodule on PR SemiAnalysisAI#993 owns replay. Delete: - utils/bench_serving/benchmark_export_replay.py - utils/process_result_isb1.py - utils/test_benchmark_export_replay.py - utils/test_process_result_isb1.py Revert to upstream/main: - utils/process_result.py - utils/test_process_result.py Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

cquil11 added the experimental label Apr 1, 2026

github-project-automation Bot added this to InferenceMAX Board Apr 1, 2026

github-advanced-security AI found potential problems Apr 1, 2026

View reviewed changes

cquil11 and others added 25 commits April 1, 2026 15:27

remove deprecated GpuTransferCollector from metrics collector

28991eb

Replaced by vLLM's native kv_offload metrics. Removes subprocess/threading imports and ~100 lines of dead code. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

remove unused Protocol import

6a41d49

add H100 LMCache trace sweep config

ee76767

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

remove --fixed-schedule: use concurrency mode per Samuel's recommenda…

fc8e3cf

…tion

update yaml

6bbbfa9

fix H100 runner: add SCRIPT_SUFFIX support

a2e4fe6

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: mkdir RESULT_DIR before trace conversion

fee0278

add H200 LMCache trace benchmark and config

769532c

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

update yaml

02876af

fix H200-nb runner: add SCRIPT_SUFFIX support

2134fd8

fix all H200 runners: add SCRIPT_SUFFIX support

ab2812a

fix all runners: add SCRIPT_SUFFIX support

5aa993f

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

add exclusive

bd4ec30

add exclusive

a12cc9d

add exclusive

af49d11

debug

4f106b8

revert system->developer role conversion in LMCache traces

ede9bde

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix MetricsCollector missing gpu_transfer_collector attribute

a7ac440

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

cquil11 and others added 11 commits April 15, 2026 10:45

mi355x dsr1 config

e86b5d5

update kv-cache-tester: merge traces-ratelimiting branch

78d7388

Brings in curated v8 trace set, rate limiting metrics (goodput, effective TTFT, SLO tracking), and updated trace data. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix CW salloc: specify GPU type in GRES request

56bf004

Nodes define GRES with GPU subtypes (gpu:h100:8, gpu:h200:8), so salloc must request gpu:h100:N / gpu:h200:N instead of generic gpu:N. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

remove --exclusive from CW salloc, not supported on dynamic nodes

6411d18

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

update kv-cache-tester: traces only from traces-ratelimiting

b07b4eb

Only pulled trace data files (curated v8 set), no code changes. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

update kv-cache-tester: remove debug logging

00f8118

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix MI355X offloading: use native connector without HMA

6abde55

SimpleCPUOffloadConnector uses cuda.bindings (NVIDIA-only). MI355X must use --disable-hybrid-kv-cache-manager with the native OffloadingConnector. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

mi355x dsr1 config

55b53fe

remove second exclusive from b200 dgxc srun

b5e14dc

github-code-quality Bot found potential problems Apr 16, 2026

View reviewed changes

Comment thread experimental/multiturn/vllm_benchmark/simulate_hash_block_mode.py

raw = json.load(f)

block_size = raw.get("block_size", 64)

trace_id = raw.get("id", trace_path.stem)

cquil11 added 3 commits April 16, 2026 15:39

bump kv-cache-tester: print theoretical cache-hit ceilings at init

b3c4a83

bump kv-cache-tester: add theoretical_cumulative_cache_hit_rate (infi…

fcb80e8

…nite cache) to assessment periods

bump kv-cache-tester: fix RequestMetrics dataclass field ordering

4e49761

cquil11 and others added 2 commits April 20, 2026 09:13

Merge branch 'main' into experimental/agentic-benchmark

d66f3b1

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

github-code-quality Bot found potential problems Apr 20, 2026

View reviewed changes

Comment thread experimental/multiturn/vllm_benchmark/flamegraphs/generate_flamegraphs.py

import argparse

import json

import os

cquil11 and others added 5 commits April 20, 2026 11:20

fix: include debug_trace.jsonl in artifact upload

fc0be5b

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

bump kv-cache-tester: add raw_chunks to debug trace

f57f7c8

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

bump kv-cache-tester: fix no_max_tokens attribute error

816e410

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

OCWC22 mentioned this pull request Apr 22, 2026

Add opt-in KV offload sweep, probe, and operator playbook OCWC22/InferenceX#3

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP][experimental] add agentic trace replay benchmark infrastructure#993

[WIP][experimental] add agentic trace replay benchmark infrastructure#993
cquil11 wants to merge 80 commits intomainfrom
experimental/agentic-benchmark

cquil11 commented Apr 1, 2026

Uh oh!

github-actions Bot commented Apr 1, 2026

Uh oh!

Check warning

Check warning

Check warning

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

cquil11 commented Apr 1, 2026

Uh oh!

github-actions Bot commented Apr 1, 2026

Uh oh!

Check warning

Check warning

Check warning

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants