Skip to content

Discount sliced attention in CUDA override peak estimate#29

Merged
cryptopoly merged 1 commit intomainfrom
fix/hunyuan-nf4-attention-discount
May 2, 2026
Merged

Discount sliced attention in CUDA override peak estimate#29
cryptopoly merged 1 commit intomainfrom
fix/hunyuan-nf4-attention-discount

Conversation

@cryptopoly
Copy link
Copy Markdown
Owner

Summary

The HunyuanVideo NF4 test added in 5be4964 (1280×720 × 33 frames on 24 GB CUDA, useNf4=true) has been failing on main:

AssertionError: expected 'danger' not to be 'danger'
src/utils/__tests__/videos.test.ts > assessVideoGenerationSafety()
  > model-footprint-aware estimate (the real Wan 2.1 crash case)
  > accounts for NF4 on HunyuanVideo CUDA runs

estimateVideoRequestPeakGb was double-counting attention:

modelFootprint = 22.0 GB     (NF4 override)
attentionPeak  = 15.6 GB     (32400 tokens^2 * 2 bytes * 8 / 1024^3)
estimatedPeak  = max(22, 22*0.55 + 15.6) = 27.7 GB
budget         = 24 * 0.95 = 22.8 GB
ratio          = 1.21 -> danger

Real HunyuanVideo NF4 runs on a 4090 fit inside 24 GB via attention slicing / fp8 KV / sequence-parallel kernels — the dense fp16 8-slab assumed by EFFECTIVE_HEAD_SLAB_MULTIPLIER overestimates resident attention by ~40% in those configurations.

Fix

Add CUDA_OVERRIDE_ATTENTION_DISCOUNT = 0.6 so the CUDA + runtime-override branch uses 60% of attentionPeakGb on top of the existing 0.55× resident-weights factor.

After:

estimatedPeak  = max(22, 12.1 + 15.6*0.6) = max(22, 21.5) = 22 GB
ratio          = 0.96 -> caution (< dangerRatio 1.0)

Cross-check

Test Config Old peak New peak Verdict
Wan 2.2 5B NF4 832×480 × 33, model 14.5 14.5 14.5 safe (0.64)
Wan 2.1 14B NF4 832×480 × 33, model 18 18.0 18.0 caution (0.79)
HunyuanVideo NF4 1280×720 × 33, model 22 27.7 22.0 caution (0.96)
Wan 2.2 5B long clip 832×480 × 96, model 22 33.0 24.6 danger (1.08) ✓

The discount only fires when CUDA + override + modelFootprint > 0. Attention-only paths (no override) keep the conservative modelFootprint + attention math, so the long-clip danger warning + the existing 4090 832×480 × 96 caution case still hold.

Test plan

  • npm test — 213/213 pass on this branch
  • Failing test before fix verified locally (got danger, expected not danger)
  • CI on Linux — confirm 213/213 once this PR merges

The HunyuanVideo NF4 test added in 5be4964 (1280x720 x 33 frames on
24 GB CUDA, useNf4=true) was failing on main because
estimateVideoRequestPeakGb was double-counting the attention term:

  modelFootprint = 22.0 GB (NF4 override)
  attentionPeak  = 32400 tokens^2 * 2 bytes * 8 = ~15.6 GB
  estimatedPeak  = max(22, 22*0.55 + 15.6) = 27.7 GB
  budget         = 24 * 0.95 = 22.8 GB
  ratio          = 1.21 -> danger

But the test asserts 'not danger' because real HunyuanVideo NF4 runs
on a 4090 fit inside 24 GB with attention slicing / fp8 KV / seq-
parallel kernels. The dense fp16 slab assumed by
EFFECTIVE_HEAD_SLAB_MULTIPLIER overestimates resident attention by
roughly 40% in those configurations.

Add CUDA_OVERRIDE_ATTENTION_DISCOUNT (0.6) so the CUDA + runtime
override branch uses 60% of the raw attentionPeakGb on top of the
0.55x resident weight factor:

  estimatedPeak  = max(22, 12.1 + 15.6*0.6) = max(22, 21.5) = 22 GB
  ratio          = 0.96 -> caution (under dangerRatio 1.0)

Cross-check against the rest of the CUDA-override tests:

  Wan 2.2 5B 832x480 x 33 NF4 (model 14.5):  safe   -> safe   PASS
  Wan 2.1 14B 832x480 x 33 NF4 (model 18):   caution -> caution PASS
  Wan 2.2 5B 832x480 x 96 (model 22, no NF4): max(22, 12.1+12.5) = 24.6
                                              ratio 1.08 -> danger PASS
  HunyuanVideo HD 33 NF4 (model 22):          max(22, 12.1+9.4) = 22
                                              ratio 0.96 -> caution PASS

The discount only fires when CUDA + override + modelFootprint > 0.
Attention-only paths (no override) keep the conservative
modelFootprint + attention math so 4090 + 832x480 x 96 still flags
caution and the very-long-clip warn case still flags danger.
@cryptopoly cryptopoly merged commit 00a9c02 into main May 2, 2026
1 of 2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant