[tx] DeepseekV3 implementation #889

tanmaysachan · 2026-01-16T08:44:08Z

Addresses #865

gemini-code-assist

Code Review

This pull request introduces the JAX implementation for the DeepseekV3 model. The implementation is comprehensive and covers the model's unique features like Multi-Head Latent Attention and Mixture of Experts with shared experts. The code is well-structured.

My review focuses on a critical bug that will prevent the model from running, along with some suggestions to improve maintainability by reducing code duplication and avoiding magic numbers. Addressing these points will make the implementation more robust and easier to maintain.

gemini-code-assist · 2026-01-16T08:47:39Z

skyrl-tx/tx/models/deepseekv3.py

+        # Precompute RoPE frequencies
+        # qk_rope_head_dim = config.qk_rope_head_dim
+        # original_seq_len = getattr(config, "original_seq_len", config.max_position_embeddings)
+        # rope_factor = getattr(config, "rope_factor", 1.0)
+        # beta_fast = getattr(config, "beta_fast", 32)
+        # beta_slow = getattr(config, "beta_slow", 1)
+
+        # TODO: Swap out like llama's rope?
+        # self.freqs_cis = precompute_freqs_cis(
+        #     dim=qk_rope_head_dim,
+        #     max_seq_len=config.max_position_embeddings,
+        #     rope_theta=config.rope_theta,
+        #     original_seq_len=original_seq_len,
+        #     rope_factor=rope_factor,
+        #     beta_fast=beta_fast,
+        #     beta_slow=beta_slow,
+        # )


This block for precomputing RoPE frequencies is commented out, but self.freqs_cis is used in DeepseekV3Model.__call__ at line 571. This will raise an AttributeError at runtime.

Looking at the DeepseekV3MLA implementation, the freqs_cis parameter is not used. Instead, apply_rope is called, which computes the frequencies on the fly.

To fix this, you should remove the freqs_cis parameter from the entire call chain, as it appears to be unused. This involves:

Removing freqs_cis: jax.Array from the signature of DeepseekV3MLA.__call__.

Removing freqs_cis: jax.Array from the signature of DeepseekV3DecoderLayer.__call__.

Removing the freqs_cis=self.freqs_cis argument from the layer() call within DeepseekV3Model.__call__.

This will resolve the crash and align the code with the current apply_rope implementation. You can then address the TODO about swapping the RoPE implementation in a separate change.

skyrl-tx/tx/models/deepseekv3.py

tanmaysachan · 2026-01-18T13:59:52Z

@pcmoritz The PR is open for reviews now

In the first test case I've added a todo - there seems to be some kind of drift which requires absolute tolerance to be around ~6e-3 for tests to pass. I'll investigate a little more, nothing seemed to have caught my eye so far

tanmaysachan · 2026-01-20T08:58:58Z

Fixed the source of the drift, there was a default config mismatch

pcmoritz · 2026-01-21T01:58:28Z

This is awesome! Have you already gotten some end-to-end training working with it? It would be great to add one to https://github.com/NovaSky-AI/SkyRL/blob/main/skyrl-tx/README.md. If you haven't I'm also more than happy to help with it :)

tanmaysachan · 2026-01-21T03:20:29Z

Looks like the tests are failing, unable to replicate this on my machine somehow.
Having a look

Some qwen tests also seem to be failing - is this expected?

Have not been able to train end-to-end yet, will give it a shot over the weekend! (with any further fixes required). Added a task for it in the PR description

tanmaysachan · 2026-01-23T03:56:09Z

Failing tests root cause: Huggingface outputs are not consistent between MacOS and Ubuntu (Accelerate vs MKL)

Linux

OS: Linux 6.8.0-90-genericMachine: x86_64Python: 3.12.12PyTorch: 2.10.0+cu128PyTorch BLAS: mklCUDA available: FalseTransformers: 4.57.6

DEEPSEEK V3 TEST

HF hidden_states[-1] first 10 values (sample 0, pos 0):
[-0.05490041896700859, -0.6639361381530762, -0.4137983024120331, 0.19858041405677795, 0.4002900719642639, -1.8006019592285156, -0.7636783123016357, -0.6883448958396912, 0.39694416522979736, 2.5040738582611084]

Macos

OS: Darwin 25.2.0Machine: arm64PyTorch BLAS: accelerateTransformers: 4.57.6

DEEPSEEK V3 TEST
HF hidden_states[-1] first 10 values (sample 0, pos 0):
[-0.0496, -0.6667, -0.4240, 0.1903, 0.4095, -1.8056, -0.7479, -0.6778, 0.3872, 2.5022]

Bumping thresholds

vercel · 2026-01-25T09:00:52Z

@tanmaysachan is attempting to deploy a commit to the Tyler's projects Team on Vercel.

A member of the Team first needs to authorize it.

- Add LogitsProcessorMixin to DeepseekV3ForCausalLM - Add get_lm_head() method for logits computation - Fix broken compute_positions import - Fix init_lora_adapter to handle n_routed_experts attribute - Add test_deepseekv3_lora_training.py with MoE rank normalization tests Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

tanmaysachan · 2026-01-25T09:08:10Z

Accidental deployment attempt due to rebase ^

tanmaysachan · 2026-01-25T09:31:48Z

End-to-end training successfull on an A100.

/api/v1/healthz -> {"status":"ok"}

Added GPU tests (need anyscale creds to run)

GPU tests on A100:

(skyrl-tx) (main) root@C.30490604:/workspace/SkyRL/skyrl-tx$ uv run --extra gpu python -m pytest tests/models/test_deepseekv3_lora_training.py -v
======================================================================================= test session starts ========================================================================================
platform linux -- Python 3.11.14, pytest-9.0.2, pluggy-1.6.0 -- /workspace/SkyRL/skyrl-tx/.venv/bin/python3
cachedir: .pytest_cache
rootdir: /workspace/SkyRL/skyrl-tx
configfile: pyproject.toml
plugins: anyio-4.12.1
collected 2 items

tests/models/test_deepseekv3_lora_training.py::test_lora_training_moe_rank_normalized PASSED [ 50%]
tests/models/test_deepseekv3_lora_training.py::test_lora_training_high_rank PASSED [100%]

pcmoritz · 2026-01-27T07:20:03Z

@tanmaysachan Thanks a lot for implementing this, this is excellent work! I'd like to get something like the following working

uv run --extra gpu --extra tinker -m tx.tinker.api --base-model zai-org/GLM-4.7-Flash  --backend-config '{"max_lora_adapters": 2, "max_lora_rank": 1, "expert_parallel_size": 8, "train_micro_batch_size": 1, "shard_attention_heads": false}'

(I think we should be able to support zai-org/GLM-4.7-Flash since it has a DeepseekV3 like architecture). This can be run on 8xH100 I think and so will be great to add to the next release notes for people to try this out :)

I can look a little more into what is needed to make this work :)

pcmoritz · 2026-01-27T07:29:08Z

Let me first merge this PR, and then we can implement zai-org/GLM-4.7-Flash on top of it (since I think that will be a little more work)

pcmoritz · 2026-01-27T07:29:52Z

/gemini review

tanmaysachan · 2026-01-27T07:35:00Z

Sure, I can add GLM in a followup (this one is quite bloated already 😅)

pcmoritz · 2026-01-27T09:59:03Z

skyrl-tx/tx/models/deepseekv3.py

+        top_k_weights, top_k_index = self._compute_routing(router_logits)
+
+        expert_output = self.experts(hidden_states_flat, top_k_index, top_k_weights, adapter_indices_flat)
+        shared_output = self.shared_experts(


Reminder to self: It will be great if we can make MLP layers also support flattened states going forward, so this reshaping can be removed. It would also make the layer chunking #902 nicer. Shouldn't be hard to implement, mainly needs some refactoring of the LoRAMixin

pcmoritz · 2026-01-27T10:01:51Z

skyrl-tx/tx/utils/models.py

            continue
        if "experts" in path:
-            tensors[key] = np.stack([tensors[get_expert_key(path, i)].T for i in range(config.num_experts)], axis=0)
+            num_experts = getattr(config, "num_experts", None) or getattr(config, "n_routed_experts")


TODO: I think this can be handled in a unified way by ModelConfig, e.g. by exposing a get_num_experts method

pcmoritz · 2026-01-27T10:07:05Z

I'll need a little more time tomorrow to finish reviewing the PR, in particular the attention part and the tests, but so far it looks great! I pushed some small fixes and cleanups (if you end up working on it some more, don't forget to pull first). Hopefully we can get it merged tomorrow!

Initialize the structure

4d31abf

gemini-code-assist bot reviewed Jan 16, 2026

View reviewed changes

tanmaysachan added 2 commits January 17, 2026 01:19

simplify MLP

6fbf660

adjust for huggingface naming conventions

8a6feac

pcmoritz added the tx label Jan 17, 2026

tanmaysachan added 2 commits January 18, 2026 19:21

Test for parity, add unit tests

0c1792e

Add TODO for tolerance

5555ec5

tanmaysachan added 3 commits January 18, 2026 19:34

Remove stray prints

3cfa781

update cache position off

6354df8

Fix drift

4983b0f

pcmoritz added 2 commits January 20, 2026 18:01

fix ruff

9d1720a

fix black

7abc0d6

Change masked fill to 0.0

94ea901

tanmaysachan added 5 commits January 23, 2026 10:29

Bump thresholds for BLAS differences

1993bb8

Retrigger CI

7acbdca

Merge branch 'main' into tanmay/deepseek_impl

1ec4c30

Merge with main, remove logits

6edbabf

more threshold tuning

23e3fd2

tanmaysachan force-pushed the tanmay/deepseek_impl branch from 23cfe46 to 7208b55 Compare January 25, 2026 09:06

Bump for CI

16b1ea1

tanmaysachan added 2 commits January 25, 2026 14:46

Add deepseek config to tinker backend

836babe

Add deepseek to readme

6ae3c4d

Retrigger CI

fb01eca

fix ci

5084e70

pcmoritz added 7 commits January 27, 2026 00:00

Merge branch 'main' into tanmay/deepseek_impl

b1ea17b

fix warnings

73573f0

update

d042b29

simplify

864e678

simplify

6694c92

cleanup

0857151

update

f080a28

pcmoritz reviewed Jan 27, 2026

View reviewed changes

Add unified access to number of experts

daa7f0e

[tx] DeepseekV3 implementation #889

Are you sure you want to change the base?

[tx] DeepseekV3 implementation #889

Conversation

tanmaysachan commented Jan 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Jan 16, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tanmaysachan commented Jan 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tanmaysachan commented Jan 20, 2026

Uh oh!

pcmoritz commented Jan 21, 2026

Uh oh!

tanmaysachan commented Jan 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tanmaysachan commented Jan 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Linux

Macos

Uh oh!

vercel bot commented Jan 25, 2026

Uh oh!

tanmaysachan commented Jan 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tanmaysachan commented Jan 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pcmoritz commented Jan 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pcmoritz commented Jan 27, 2026

Uh oh!

pcmoritz commented Jan 27, 2026

Uh oh!

tanmaysachan commented Jan 27, 2026

Uh oh!

pcmoritz Jan 27, 2026

Choose a reason for hiding this comment

Uh oh!

pcmoritz Jan 27, 2026

Choose a reason for hiding this comment

Uh oh!

tanmaysachan Jan 27, 2026

Choose a reason for hiding this comment

Uh oh!

pcmoritz commented Jan 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

tanmaysachan commented Jan 16, 2026 •

edited

Loading

tanmaysachan commented Jan 18, 2026 •

edited

Loading

tanmaysachan commented Jan 21, 2026 •

edited

Loading

tanmaysachan commented Jan 23, 2026 •

edited

Loading

tanmaysachan commented Jan 25, 2026 •

edited

Loading

tanmaysachan commented Jan 25, 2026 •

edited

Loading

pcmoritz commented Jan 27, 2026 •

edited

Loading