Discount sliced attention in CUDA override peak estimate#29
Merged
cryptopoly merged 1 commit intomainfrom May 2, 2026
Merged
Conversation
The HunyuanVideo NF4 test added in 5be4964 (1280x720 x 33 frames on 24 GB CUDA, useNf4=true) was failing on main because estimateVideoRequestPeakGb was double-counting the attention term: modelFootprint = 22.0 GB (NF4 override) attentionPeak = 32400 tokens^2 * 2 bytes * 8 = ~15.6 GB estimatedPeak = max(22, 22*0.55 + 15.6) = 27.7 GB budget = 24 * 0.95 = 22.8 GB ratio = 1.21 -> danger But the test asserts 'not danger' because real HunyuanVideo NF4 runs on a 4090 fit inside 24 GB with attention slicing / fp8 KV / seq- parallel kernels. The dense fp16 slab assumed by EFFECTIVE_HEAD_SLAB_MULTIPLIER overestimates resident attention by roughly 40% in those configurations. Add CUDA_OVERRIDE_ATTENTION_DISCOUNT (0.6) so the CUDA + runtime override branch uses 60% of the raw attentionPeakGb on top of the 0.55x resident weight factor: estimatedPeak = max(22, 12.1 + 15.6*0.6) = max(22, 21.5) = 22 GB ratio = 0.96 -> caution (under dangerRatio 1.0) Cross-check against the rest of the CUDA-override tests: Wan 2.2 5B 832x480 x 33 NF4 (model 14.5): safe -> safe PASS Wan 2.1 14B 832x480 x 33 NF4 (model 18): caution -> caution PASS Wan 2.2 5B 832x480 x 96 (model 22, no NF4): max(22, 12.1+12.5) = 24.6 ratio 1.08 -> danger PASS HunyuanVideo HD 33 NF4 (model 22): max(22, 12.1+9.4) = 22 ratio 0.96 -> caution PASS The discount only fires when CUDA + override + modelFootprint > 0. Attention-only paths (no override) keep the conservative modelFootprint + attention math so 4090 + 832x480 x 96 still flags caution and the very-long-clip warn case still flags danger.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
The HunyuanVideo NF4 test added in 5be4964 (1280×720 × 33 frames on 24 GB CUDA,
useNf4=true) has been failing on main:estimateVideoRequestPeakGbwas double-counting attention:Real HunyuanVideo NF4 runs on a 4090 fit inside 24 GB via attention slicing / fp8 KV / sequence-parallel kernels — the dense fp16 8-slab assumed by
EFFECTIVE_HEAD_SLAB_MULTIPLIERoverestimates resident attention by ~40% in those configurations.Fix
Add
CUDA_OVERRIDE_ATTENTION_DISCOUNT = 0.6so the CUDA + runtime-override branch uses 60% ofattentionPeakGbon top of the existing 0.55× resident-weights factor.After:
Cross-check
The discount only fires when CUDA + override + modelFootprint > 0. Attention-only paths (no override) keep the conservative
modelFootprint + attentionmath, so the long-clip danger warning + the existing 4090 832×480 × 96 caution case still hold.Test plan
npm test— 213/213 pass on this branch