Add opt-in KV offload sweep, probe, and operator playbook by OCWC22 · Pull Request #3 · OCWC22/InferenceX

OCWC22 · 2026-04-22T09:16:59Z

Summary

Opt-in, additive KV-cache offload surface for SemiAnalysisAI#993. Granular --cpu-offload-gb sweep, live offload probe, LMCache NVMe recipe, curated pressure subset, operator playbook.

Scope

Stacks on: SemiAnalysisAI#1032. No experimental/** touches. No edits to Cam's *_lmcache_aiperf.sh.

Opt-in, additive extension of the KV-cache-offloading surface introduced in SemiAnalysisAI#993. No harness edits required — operators who run Cam's existing multiturn_fp8_h200_trace_replay.sh or multiturn_fp8_h100_lmcache_aiperf.sh get these knobs by passing a different sweep config and, optionally, a parallel probe script. Zero changes under experimental/**.

Cherry-pickable onto upstream because the change is config/docs/tooling only.

What upstream gets (if cherry-picked)

Granular --cpu-offload-gb sweep config
kv_offload_probe.py side-car for vLLM /metrics
LMCache NVMe cold-tier recipe (config file; no script edits required)
Curated KV-pressure subset + validator extension
Operator playbook

Verification

python tools/validate_kvcache_tester_trace.py datasets/isb1/converted/ --pressure-manifest datasets/isb1/kv_pressure/manifest.json → clean
/opt/homebrew/opt/python@3.13/bin/python3.13 -m unittest tools.test_kv_offload_probe -v → clean
/opt/homebrew/opt/python@3.13/bin/python3.13 -c "import yaml; yaml.safe_load(open('.github/configs/multiturn-agentic-trace-isb1-offload-sweep.yaml'))" → OK

Stacks on 38fd91a (PR SemiAnalysisAI#1032), mirrors the opt-in fork framing from b31f7c1 (fork PR #2), and stays sibling to 992ff21 (GMI runbook).

github-actions · 2026-04-22T09:17:11Z

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

If additional help is needed, PR authors can reach out to core maintainers over Slack.

Copilot

Pull request overview

This PR extends the benchmarking/eval surface to support separate multi-node eval jobs and adds operator-facing documentation/configuration and additional benchmark recipes (plus new/updated ISB1 dataset artifacts via Git LFS).

Changes:

Split eval results into single-node vs multi-node (evals vs multinode_evals) and extend validation to accept eval-conc for multi-node entries.
Add multi-node eval-only execution path for AMD multi-node runner scripts and wire a dedicated sweep-multi-node-evals job into run-sweep.yml.
Add multiple new single-node benchmark scripts, new runner script(s), docs, and new/updated ISB1 dataset manifests/pointers + .gitattributes for LFS.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated no comments.

Show a summary per file

File	Description
utils/process_changelog.py	Splits eval results into `evals` and `multinode_evals` based on presence of `prefill`.
utils/matrix_logic/validation.py	Adds `eval-conc` field and `multinode_evals` to validated changelog matrix output.
utils/evals/EVALS.md	Updates documentation to describe separate single-node vs multi-node eval job behavior and env var wiring.
utils/bench_serving/KNOWN_LIMITATION.md	Adds a known limitation note for `bench_serving` client behavior at ultra-high QPS.
benchmarks/multi_node/amd_utils/submit.sh	Threads eval-related env vars into the AMD multi-node submission environment.
benchmarks/multi_node/amd_utils/job.slurm	Threads eval-related env vars into the Docker container environment.
benchmarks/multi_node/amd_utils/server.sh	Adds eval-only skip for throughput + runs lm-eval in multi-node flow when requested.
.github/workflows/run-sweep.yml	Adds `sweep-multi-node-evals` job and updates `collect-evals` dependencies/condition.
.github/workflows/profile.yml	Simplifies Slurm cleanup to always scancel by runner name when Slurm is present.
.github/workflows/benchmark-tmpl.yml	Simplifies Slurm cleanup to always scancel by runner name; adds pre-run cleanup of stale eval outputs.
.github/workflows/pr-recipe-reminder.yml	Adjusts reminder comment content to include additional guidance about rerunning actions / support.
.github/workflows/claude-pr-review.yml	Adds guidance about `perf-changelog.yaml` append-only chronological ordering.
.github/PULL_REQUEST_TEMPLATE/pull_request_template.md	Adds a checkbox to remind authors to append perf-changelog entries to the end.
runners/launch_mi325x-amds.sh	Adds a Slurm+Enroot-based runner launcher script for MI325X.
runners/launch_b200-dgxc.sh	Removes a legacy runner launcher script.
benchmarks/single_node/*	Adds multiple new single-node benchmark recipes (Qwen3.5, GLM5/5.1, MiniMax, Kimi, DSR1, etc.) and tweaks a couple existing ones.
docs/lmcache_nvme_recipe.md	Adds an operator recipe for LMCache NVMe cold-tier configuration via `LMCACHE_EXTRA_CONFIG_FILE`.
docs/kv_offload_readme.md	Adds a KV offload operator readme describing sweep/probe/recipe surfaces.
datasets/isb1/**	Adds/updates ISB1 dataset manifests and many Git LFS pointer files plus `datasets/isb1/.gitattributes` for LFS rules.
.gitattributes	Adds Git LFS rules for ISB1 export JSONs.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

feat: add KV offload extension surface [skip-sweep]

3976188

Stacks on 38fd91a (PR SemiAnalysisAI#1032), mirrors the opt-in fork framing from b31f7c1 (fork PR #2), and stays sibling to 992ff21 (GMI runbook).

Copilot AI review requested due to automatic review settings April 22, 2026 09:17

Copilot started reviewing on behalf of OCWC22 April 22, 2026 09:17 View session

OCWC22 changed the base branch from main to isb1/kv-cache-stress-benchmark April 22, 2026 09:18

Copilot AI reviewed Apr 22, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add opt-in KV offload sweep, probe, and operator playbook#3

Add opt-in KV offload sweep, probe, and operator playbook#3
OCWC22 wants to merge 1 commit intoisb1/kv-cache-stress-benchmarkfrom
isb1/kv-cache-offload-extension

OCWC22 commented Apr 22, 2026

Uh oh!

github-actions Bot commented Apr 22, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

OCWC22 commented Apr 22, 2026

Summary

Scope

What upstream gets (if cherry-picked)

Verification

Uh oh!

github-actions Bot commented Apr 22, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants