Skip to content

Pass packed boundary metadata to Qwen3.5 linear-attention fast kernels#44867

Closed
sdharani91 wants to merge 1 commit intohuggingface:mainfrom
sdharani91:feature_packing_qwen
Closed

Pass packed boundary metadata to Qwen3.5 linear-attention fast kernels#44867
sdharani91 wants to merge 1 commit intohuggingface:mainfrom
sdharani91:feature_packing_qwen

Conversation

@sdharani91
Copy link
Copy Markdown

@sdharani91 sdharani91 commented Mar 19, 2026

What does this PR do?

Fixes #44717

This PR fixes packed-sequence handling for the Qwen3.5 linear-attention fast path.

Before this change, Qwen3.5 produced different outputs for:
a padded representation of multiple sequences
a packed representation of the same sequences using reset position_ids

The issue was specific to the linear-attention fast path. Full-attention layers already respected packed boundaries through the shared masking logic, but the Qwen3.5 fast linear-attention path was not passing packed-boundary metadata into its kernels.

This PR fixes that by:

deriving packed boundary metadata from packed position_ids
passing seq_idx to the causal convolution fast path
passing cu_seqlens to the FLA gated-delta-rule fast path
The change is intentionally scoped to the Qwen3.5 fast path for packed prefill inputs. The slow fallback path is not changed in this PR.

How was this tested?

Manual validation:

Reproduced the bug before the fix on Qwen3.5 using a tiny local config with one full-attention layer and one linear-attention layer.
Compared:
padded inputs for multiple sequences
packed inputs for the same sequences with reset position_ids
Before the fix on the fast path:
allclose: False
max abs diff was about 8e-3
After the fix on the fast path:
the original 2-segment packed-vs-padded repro matches
a multi-segment packed-vs-padded repro also matches with max abs diff around 6e-8
Sanity checks:

Verified Qwen3.5 was using the fast kernels:
causal_conv1d_fn present: True
fla.ops.gated_delta_rule.chunk
fla.ops.gated_delta_rule.fused_recurrent
Verified a normal unpacked Qwen3.5 forward still works after the change.

Unit tests:
Added tests for the packed-metadata helper in tests/models/qwen3_5/test_modeling_qwen3_5.py, including:
simple packed input
multi-segment packed input
cases where packed metadata should be skipped, such as cached inputs or unsupported batch layouts

Before submitting

Who can review?

@vasqu

@github-actions
Copy link
Copy Markdown
Contributor

[For maintainers] Suggested jobs to run (before merge)

run-slow: qwen3_5

@sdharani91
Copy link
Copy Markdown
Author

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support packed sequences for linear attention models (i.e. Qwen3.5)

2 participants