Skip to content

Fix missing ReLU in GLM-MOE-DSA indexer scoring#44690

Closed
gambletan wants to merge 1 commit intohuggingface:mainfrom
gambletan:fix/glm-moe-dsa-relu
Closed

Fix missing ReLU in GLM-MOE-DSA indexer scoring#44690
gambletan wants to merge 1 commit intohuggingface:mainfrom
gambletan:fix/glm-moe-dsa-relu

Conversation

@gambletan
Copy link
Copy Markdown

Summary

Fixes #44360

The GlmMoeDsaIndexer is missing a ReLU activation on the per-head dot-product scores before the weighted sum across heads. The reference DeepSeek V3.2 implementation applies ReLU inside the fp8_index kernel:

# Reference: inference/kernel.py – fp8_index_kernel
logits[i3_n, i_h] = T.max(logits[i3_n, i_h], 0) * q_s_frag[i_h]

The computation flow in the kernel is:

  1. logits = q @ k^T (per-head dot products)
  2. logits = relu(logits) * weights (ReLU, then multiply by head weights)
  3. index_score = sum_h(logits) (reduce across heads)

The HF bf16 equivalent was missing step 2's ReLU. Without it, negative attention scores incorrectly contribute to index scoring, which can affect top-k token selection for sparse attention.

Change

Added torch.nn.functional.relu(scores) after the per-head q·k^T computation and before the weighted sum, in both modular_glm_moe_dsa.py and modeling_glm_moe_dsa.py:

  scores = torch.einsum("bshd,btd->bsht", q.float(), k_cached.float()) * self.softmax_scale
+
+ # ReLU matches the reference fp8_index kernel: T.max(logits, 0) before weighting
+ scores = torch.nn.functional.relu(scores)
+
  index_scores = torch.einsum("bsht,bsh->bst", scores, weights)

Test plan

  • Verify model outputs match the reference DeepSeek V3.2 implementation more closely with this fix
  • Run existing GLM-MOE-DSA tests to confirm no regressions

The DSA indexer was missing a ReLU activation on the per-head
dot-product scores before the weighted sum across heads. The reference
DeepSeek V3.2 implementation applies ReLU inside the fp8_index kernel
via `T.max(logits, 0)` before multiplying by head weights. Without
this, negative attention scores incorrectly contribute to the index
scoring, which can affect top-k token selection for sparse attention.

Fixes huggingface#44360

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown
Contributor

[For maintainers] Suggested jobs to run (before merge)

run-slow: glm_moe_dsa

@Rocketknight1
Copy link
Copy Markdown
Member

Running a code agent to spam our notifications with a redundant PR when there's a maintainer PR already at #44564 is a good way to get blocked - be careful!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug/Discussion] The DSA indexer lacks a ReLU

2 participants