Add opt-in KV offload sweep, probe, and operator playbook#3
Add opt-in KV offload sweep, probe, and operator playbook#3OCWC22 wants to merge 1 commit intoisb1/kv-cache-stress-benchmarkfrom
Conversation
Stacks on 38fd91a (PR SemiAnalysisAI#1032), mirrors the opt-in fork framing from b31f7c1 (fork PR #2), and stays sibling to 992ff21 (GMI runbook).
|
Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow If additional help is needed, PR authors can reach out to core maintainers over Slack. |
There was a problem hiding this comment.
Pull request overview
This PR extends the benchmarking/eval surface to support separate multi-node eval jobs and adds operator-facing documentation/configuration and additional benchmark recipes (plus new/updated ISB1 dataset artifacts via Git LFS).
Changes:
- Split eval results into single-node vs multi-node (
evalsvsmultinode_evals) and extend validation to accepteval-concfor multi-node entries. - Add multi-node eval-only execution path for AMD multi-node runner scripts and wire a dedicated
sweep-multi-node-evalsjob intorun-sweep.yml. - Add multiple new single-node benchmark scripts, new runner script(s), docs, and new/updated ISB1 dataset manifests/pointers +
.gitattributesfor LFS.
Reviewed changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated no comments.
Show a summary per file
| File | Description |
|---|---|
| utils/process_changelog.py | Splits eval results into evals and multinode_evals based on presence of prefill. |
| utils/matrix_logic/validation.py | Adds eval-conc field and multinode_evals to validated changelog matrix output. |
| utils/evals/EVALS.md | Updates documentation to describe separate single-node vs multi-node eval job behavior and env var wiring. |
| utils/bench_serving/KNOWN_LIMITATION.md | Adds a known limitation note for bench_serving client behavior at ultra-high QPS. |
| benchmarks/multi_node/amd_utils/submit.sh | Threads eval-related env vars into the AMD multi-node submission environment. |
| benchmarks/multi_node/amd_utils/job.slurm | Threads eval-related env vars into the Docker container environment. |
| benchmarks/multi_node/amd_utils/server.sh | Adds eval-only skip for throughput + runs lm-eval in multi-node flow when requested. |
| .github/workflows/run-sweep.yml | Adds sweep-multi-node-evals job and updates collect-evals dependencies/condition. |
| .github/workflows/profile.yml | Simplifies Slurm cleanup to always scancel by runner name when Slurm is present. |
| .github/workflows/benchmark-tmpl.yml | Simplifies Slurm cleanup to always scancel by runner name; adds pre-run cleanup of stale eval outputs. |
| .github/workflows/pr-recipe-reminder.yml | Adjusts reminder comment content to include additional guidance about rerunning actions / support. |
| .github/workflows/claude-pr-review.yml | Adds guidance about perf-changelog.yaml append-only chronological ordering. |
| .github/PULL_REQUEST_TEMPLATE/pull_request_template.md | Adds a checkbox to remind authors to append perf-changelog entries to the end. |
| runners/launch_mi325x-amds.sh | Adds a Slurm+Enroot-based runner launcher script for MI325X. |
| runners/launch_b200-dgxc.sh | Removes a legacy runner launcher script. |
| benchmarks/single_node/* | Adds multiple new single-node benchmark recipes (Qwen3.5, GLM5/5.1, MiniMax, Kimi, DSR1, etc.) and tweaks a couple existing ones. |
| docs/lmcache_nvme_recipe.md | Adds an operator recipe for LMCache NVMe cold-tier configuration via LMCACHE_EXTRA_CONFIG_FILE. |
| docs/kv_offload_readme.md | Adds a KV offload operator readme describing sweep/probe/recipe surfaces. |
| datasets/isb1/** | Adds/updates ISB1 dataset manifests and many Git LFS pointer files plus datasets/isb1/.gitattributes for LFS rules. |
| .gitattributes | Adds Git LFS rules for ISB1 export JSONs. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Summary
Opt-in, additive KV-cache offload surface for SemiAnalysisAI#993. Granular
--cpu-offload-gbsweep, live offload probe, LMCache NVMe recipe, curated pressure subset, operator playbook.Scope
Stacks on: SemiAnalysisAI#1032. No
experimental/**touches. No edits to Cam's*_lmcache_aiperf.sh.Opt-in, additive extension of the KV-cache-offloading surface introduced in SemiAnalysisAI#993. No harness edits required — operators who run Cam's existing
multiturn_fp8_h200_trace_replay.shormultiturn_fp8_h100_lmcache_aiperf.shget these knobs by passing a different sweep config and, optionally, a parallel probe script. Zero changes underexperimental/**.Cherry-pickable onto upstream because the change is config/docs/tooling only.
What upstream gets (if cherry-picked)
--cpu-offload-gbsweep configkv_offload_probe.pyside-car for vLLM/metricsVerification
python tools/validate_kvcache_tester_trace.py datasets/isb1/converted/ --pressure-manifest datasets/isb1/kv_pressure/manifest.json→ clean/opt/homebrew/opt/python@3.13/bin/python3.13 -m unittest tools.test_kv_offload_probe -v→ clean/opt/homebrew/opt/python@3.13/bin/python3.13 -c "import yaml; yaml.safe_load(open('.github/configs/multiturn-agentic-trace-isb1-offload-sweep.yaml'))"→ OK