Kda mixer by oleksost · Pull Request #395 · ServiceNow/Fast-LLM

oleksost · 2025-11-26T16:23:30Z

✨ Description

Should be merged after GDN #392 .

Adding KDA mixer from Kimi Lienar.

Note, for now this requires nightly triton and pytorch, see: https://github.com/fla-org/flash-linear-attention/blob/main/FAQs.md.

Merged #404 here. Tests for both hybrid_kda and apriel2_text_gdn_hybrid models pass when using the new docker image on toolkit.

🔍 Type of change

Select all that apply:

🐛 Bug fix (non-breaking change that addresses a specific issue)
🚀 New feature (non-breaking change that adds functionality)
⚠️ Breaking change (a change that could affect existing functionality)
📈 Performance improvement/optimization (improves speed, memory usage, or efficiency)
🛠️ Code refactor (non-functional changes that improve code readability, structure, etc.)
📦 Dependency bump (updates dependencies, including Dockerfile or package changes)
📝 Documentation change (updates documentation, including new content or typo fixes)
🔧 Infrastructure/Build change (affects build process, CI/CD, or dependencies)

📝 Changes

added kda.py in ssm layers
added kda to varlen test
added hybrid_kda to model configs for testing
merged Bump base image and dependencies for KDA support #404

✅ Checklist

Make sure the following tasks are completed before submitting the PR:

General

📜 I have read and followed the contributing guidelines.
🏷️ I am using a clear and descriptive PR title that summarizes the key change or feature introduced.
🎉 The functionality is complete, and I have tested the changes.
📝 I have updated the documentation if needed.
⚠️ The change does not introduce any new issues (e.g., runtime warnings, type checker errors, linting problems, unhandled edge cases).
🧩 I have commented my code, especially in hard-to-understand areas.

Dependencies and Configuration

🐋 I have updated the Docker configuration or dependencies, if applicable.
🔄 I have ensured compatibility with the existing setup after dependency changes.

Testing

🧪 I have added or updated tests to cover my changes.
✔️ New and existing tests pass locally with my changes.
🚦 I have tested these changes on GPUs and verified training stability.
🏋️ I have tested the changes on realistic training workloads, if applicable.

Performance Impact

📊 I have run benchmarks where applicable to evaluate the performance impact.
✅ The benchmarks show no performance regression.
🚀 The benchmarks indicate a potential performance improvement.
⚠️ The benchmarks indicate a potential performance degradation.
📈 I have provided benchmark results and detailed any performance impact below, if applicable.

📊 Performance Impact Details

If there is any impact on performance, describe it and provide benchmark results, if applicable:

🗒️ Additional Notes

Include any additional context, information, or considerations here, such as known issues, follow-up tasks, or backward compatibility concerns.

jlamypoirier

Some comments, most also apply to GDA

jlamypoirier · 2025-12-05T20:27:55Z

 # The image is still compatible with any user id.
 RUN useradd user
-USER user
+USER user


Unnecessary diff

jlamypoirier · 2025-12-05T20:32:37Z

        super()._validate()


+@config_class(dynamic_type={MixerConfig: "kda"})


"kimi_delta_attention"

jlamypoirier · 2025-12-05T20:33:49Z

+        desc="Configuration for the gated normalization applied to the KDA output.",
+        hint=FieldHint.architecture,
+    )
+    q_projection_layer: AffineLinearConfig = Field(


projection seems unnecessary in these fields.

jlamypoirier · 2025-12-05T20:34:29Z

+    )
+
+    @property
+    def layer_class(self) -> "type":


type["KimiDeltaAttention"]

jlamypoirier · 2025-12-05T20:36:33Z

+        return KimiDeltaAttention
+
+    def _validate(self) -> None:
+        with self._set_implicit_default():


Not sure that's a good idea, it makes configs hard to understand. Better assume the user to specify these explicitly. (and most of the time we're creating from HF so that's not a problem)

jlamypoirier · 2025-12-05T20:56:32Z

+
+
+@pytest.mark.slow
+@pytest.mark.skipif(not torch.cuda.is_available(), reason="KDA equivalence test needs CUDA")


pytest.mark.requires_cuda

jlamypoirier · 2025-12-05T21:04:03Z

+    AprielHybridSSMConfig, KimiDeltaAttention = None, None
+
+
+def _materialize_mixer_tensors(module: torch.nn.Module, distributed: Distributed, device: torch.device) -> None:


Please use get_stage, it already does this. See example here https://github.com/ServiceNow/Fast-LLM/blob/main/tests/layers/test_lm_head.py#L264

Also please don't copy utils to every file, they can go in utils

jlamypoirier · 2025-12-05T21:12:44Z

+@pytest.mark.skipif(not torch.cuda.is_available(), reason="KDA equivalence test needs CUDA")
+@pytest.mark.skipif(KimiDeltaAttention is None or AprielHybridSSMConfig is None, reason="Apriel KDA deps missing")
+@pytest.mark.skipif(kda_module.chunk_kda is None, reason="KDA fused kernels not available")
+def test_fast_llm_kda_matches_apriel_forward():


Not sure we need this test at all. test_huggingface_model already tests the equivalence

I agree that we will not need those eventually.

test_huggingface_model seem to be a heavier integration test as compared to the isolate unit tests test_kda_equivalence and test_gda_equivalence. The letter ones are more useful for development.

Can we keep them for some time until gda and kda implementations are production tested.

I believe these tests are extremely valuable and should remain.
In fact, I think we should extend them to non-FLA backup implementations of GDN and KDA.

jlamypoirier · 2025-12-05T21:15:04Z

+        ModelTestingGroup.convert: ModelTestingGroupAction.normal,
+        ModelTestingGroup.generate: ModelTestingGroupAction.not_implemented,
+        ModelTestingGroup.megatron: ModelTestingGroupAction.not_implemented,
+        ModelTestingGroup.distributed: ModelTestingGroupAction.normal,


We might want to test once and leave as unimportant, this has a huge impact on testing time.

leaving them here for now until we’ve used KDA and GDN enough to be confident they’re stable and free of issues

@main

Update to nvcr.io/nvidia/pytorch:25.11-py3 which includes: - PyTorch 2.10 - CUDA 13.0 - flash-attn 2.7.4.post1 (pre-installed, no compilation needed) Dependency updates: - causal-conv1d: v1.5.4 (was pinned to commit 2a288a1) - mamba-ssm: 2.2.6.post3 (was pinned to commit 4a8a2a2) - flash-linear-attention: pin to commit 67eee20 (was @main) - flash-attn: 2.7.4.post1 to match base image (was 2.7.3) - triton: 3.5.1 in Dockerfile (was 3.1.0) These updates enable Kimi Delta Attention (KDA) support via the flash-linear-attention library. The pinned versions are tested and working, unlike the nightly/unpinned approach in #395. Note: Dropless MoE kernel remains broken with triton >= 3.2.0 and needs a complete rewrite (also limited to 32 experts). This is tracked separately and doesn't block KDA work. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

… kda

tscholak

very good work, thank you!

I'd like us to have a non-FLA fallback for GDN and KDA, similar to our torch-compiled attention backup implementation.
You should be able to reuse much of the GDN torch code from upstream Qwen3Next and similarly from kimi linear.
And then we add a config option for GDN and KDA like so:

Fast-LLM/fast_llm/layers/attention/config.py

Lines 34 to 37 in cc009a4

    
           class AttentionImplementation(enum.StrEnum): 
        
               auto = "auto" 
        
               flash = "flash" 
        
               backup = "backup"

tscholak · 2025-12-08T16:37:56Z

+        # same as rearrange(v, '... (h d) -> ... h d', d=self.head_dim)
+        return tensor.view(tensor.shape[0], tensor.shape[1], self._local_heads, self._config.head_dim)
+
+    def _forward(


can we please have a torch-only compiled fallback in case fla isn't available?

Added these as todos in #406.

tscholak · 2025-12-08T16:40:11Z

+    fast_layer.preprocess(fast_kwargs)
+    fast_out, _ = fast_layer(hidden_states, fast_kwargs)

    torch.testing.assert_close(fast_out, hf_out, atol=1e-5, rtol=1e-5)


can we please structure this test like:

Fast-LLM/tests/test_attention.py

Line 60 in cc009a4

def test_attention_implementations(cross_document_attention: bool, causal: bool, window_size: int | None):

We should add a backup implementation for gdn in case fla isn't available

tscholak · 2025-12-08T16:40:27Z

+    fast_layer.preprocess(fast_kwargs)
+    fast_out, _ = fast_layer(hidden_states, fast_kwargs)
+
+    torch.testing.assert_close(fast_out, hf_out, atol=1e-5, rtol=1e-5)


can we please structure this test like:

Fast-LLM/tests/test_attention.py

Line 60 in cc009a4

def test_attention_implementations(cross_document_attention: bool, causal: bool, window_size: int | None):

We should add a backup implementation for kda in case fla isn't available

tscholak · 2025-12-08T16:40:34Z

+    torch.testing.assert_close(fast_out, hf_out, atol=1e-5, rtol=1e-5)
+
+
+if __name__ == "__main__":


let's remove that

tscholak · 2025-12-08T16:40:44Z

let's remove that

oleksost added 27 commits November 25, 2025 21:43

wip

1a219c4

added gdn

5242eb6

gdn layer

bec22de

kda

7f79909

wip

8636f09

convertion kda

a20c958

tp and sequence tp

8ac5167

varlen kda

f1a51f2

gdn only: varlen test

3b367d8

clean up

c48d4ee

test config

e2bb25c

wip

d4f9b85

gdn tests

8017a80

tests

1e01601

tests

ca8cb5c

nvm

694d287

requirements

d3bd916

wip

3ff7799

clean up

9a53c5b

conversion

80041ce

Merge branch 'gdn' into kda

67a234a

comments on the layour + HF forward equivalence test

d6677b0

Merge branch 'gdn' into kda

a75cd9f

wip

5d3b6d0

wip

0d41dce

varlen test

6e2c1fe

varlen test

8938a1d

oleksost requested review from jlamypoirier and tscholak November 26, 2025 20:40

oleksost marked this pull request as ready for review November 26, 2025 20:40

oleksost and others added 16 commits November 26, 2025 21:40

wip

2a30bac

kda equivalence test

cad93ab

nightly requirements

8f957a4

docker

82c9cc4

manual build

d25994e

Merge branch 'main' into gdn

3651b06

Merge branch 'gdn' into kda

7b31e78

two docker files

5a44097

test import fix

a164a2b

set correct activations

c4aa9b1

import

a8849cb

kda docker file

5f32ba7

Merge branch 'main' into kda

1dde2a9

Merge remote-tracking branch 'origin/main' into kda

685f351

revert workflow change

d33d6d7

removed unused requirements file

05abc03

jlamypoirier reviewed Dec 5, 2025

View reviewed changes

tscholak mentioned this pull request Dec 7, 2025

Bump base image and dependencies for KDA support #404

Merged

oleksost added 4 commits December 8, 2025 16:06

fixes

eb52cc7

removed kda docker since we probably do not need it

372771b

Merge branch 'main' into kda

aea8b1e

Merge remote-tracking branch 'origin/tscholak/bump-dependencies' into…

407be82

… kda

oleksost requested a review from jlamypoirier December 8, 2025 16:29

tscholak reviewed Dec 8, 2025

View reviewed changes

tscholak approved these changes Dec 8, 2025

View reviewed changes

oleksost mentioned this pull request Dec 8, 2025

Add torch fallbacks for KDA and GDN layers. #406

Open

clean

2ce4c07

oleksost merged commit 9d12e9c into main Dec 8, 2025
4 checks passed

oleksost deleted the kda branch December 8, 2025 19:45

		super()._validate()


		@config_class(dynamic_type={MixerConfig: "kda"})



		@pytest.mark.slow
		@pytest.mark.skipif(not torch.cuda.is_available(), reason="KDA equivalence test needs CUDA")

		AprielHybridSSMConfig, KimiDeltaAttention = None, None


		def _materialize_mixer_tensors(module: torch.nn.Module, distributed: Distributed, device: torch.device) -> None:

	class AttentionImplementation(enum.StrEnum):
	auto = "auto"
	flash = "flash"
	backup = "backup"

		torch.testing.assert_close(fast_out, hf_out, atol=1e-5, rtol=1e-5)


		if __name__ == "__main__":

Conversation

oleksost commented Nov 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✨ Description

🔍 Type of change

📝 Changes

✅ Checklist

General

Dependencies and Configuration

Testing

Performance Impact

📊 Performance Impact Details

🗒️ Additional Notes

Uh oh!

jlamypoirier left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tscholak left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

oleksost commented Nov 26, 2025 •

edited

Loading