-
Notifications
You must be signed in to change notification settings - Fork 238
[tx] DeepseekV3 implementation #889
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces the JAX implementation for the DeepseekV3 model. The implementation is comprehensive and covers the model's unique features like Multi-Head Latent Attention and Mixture of Experts with shared experts. The code is well-structured.
My review focuses on a critical bug that will prevent the model from running, along with some suggestions to improve maintainability by reducing code duplication and avoiding magic numbers. Addressing these points will make the implementation more robust and easier to maintain.
skyrl-tx/tx/models/deepseekv3.py
Outdated
| # Precompute RoPE frequencies | ||
| # qk_rope_head_dim = config.qk_rope_head_dim | ||
| # original_seq_len = getattr(config, "original_seq_len", config.max_position_embeddings) | ||
| # rope_factor = getattr(config, "rope_factor", 1.0) | ||
| # beta_fast = getattr(config, "beta_fast", 32) | ||
| # beta_slow = getattr(config, "beta_slow", 1) | ||
|
|
||
| # TODO: Swap out like llama's rope? | ||
| # self.freqs_cis = precompute_freqs_cis( | ||
| # dim=qk_rope_head_dim, | ||
| # max_seq_len=config.max_position_embeddings, | ||
| # rope_theta=config.rope_theta, | ||
| # original_seq_len=original_seq_len, | ||
| # rope_factor=rope_factor, | ||
| # beta_fast=beta_fast, | ||
| # beta_slow=beta_slow, | ||
| # ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This block for precomputing RoPE frequencies is commented out, but self.freqs_cis is used in DeepseekV3Model.__call__ at line 571. This will raise an AttributeError at runtime.
Looking at the DeepseekV3MLA implementation, the freqs_cis parameter is not used. Instead, apply_rope is called, which computes the frequencies on the fly.
To fix this, you should remove the freqs_cis parameter from the entire call chain, as it appears to be unused. This involves:
- Removing
freqs_cis: jax.Arrayfrom the signature ofDeepseekV3MLA.__call__. - Removing
freqs_cis: jax.Arrayfrom the signature ofDeepseekV3DecoderLayer.__call__. - Removing the
freqs_cis=self.freqs_cisargument from thelayer()call withinDeepseekV3Model.__call__.
This will resolve the crash and align the code with the current apply_rope implementation. You can then address the TODO about swapping the RoPE implementation in a separate change.
|
@pcmoritz The PR is open for reviews now In the first test case I've added a todo - there seems to be some kind of drift which requires absolute tolerance to be around ~6e-3 for tests to pass. I'll investigate a little more, nothing seemed to have caught my eye so far |
|
Fixed the source of the drift, there was a default config mismatch |
|
This is awesome! Have you already gotten some end-to-end training working with it? It would be great to add one to https://github.com/NovaSky-AI/SkyRL/blob/main/skyrl-tx/README.md. If you haven't I'm also more than happy to help with it :) |
|
Looks like the tests are failing, unable to replicate this on my machine somehow. Some qwen tests also seem to be failing - is this expected? Have not been able to train end-to-end yet, will give it a shot over the weekend! (with any further fixes required). Added a task for it in the PR description |
|
Failing tests root cause: Huggingface outputs are not consistent between MacOS and Ubuntu (Accelerate vs MKL) LinuxOS: Linux 6.8.0-90-genericMachine: x86_64Python: 3.12.12PyTorch: 2.10.0+cu128PyTorch BLAS: mklCUDA available: FalseTransformers: 4.57.6 DEEPSEEK V3 TEST HF hidden_states[-1] first 10 values (sample 0, pos 0): MacosOS: Darwin 25.2.0Machine: arm64PyTorch BLAS: accelerateTransformers: 4.57.6 DEEPSEEK V3 TEST Bumping thresholds |
|
@tanmaysachan is attempting to deploy a commit to the Tyler's projects Team on Vercel. A member of the Team first needs to authorize it. |
- Add LogitsProcessorMixin to DeepseekV3ForCausalLM - Add get_lm_head() method for logits computation - Fix broken compute_positions import - Fix init_lora_adapter to handle n_routed_experts attribute - Add test_deepseekv3_lora_training.py with MoE rank normalization tests Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
23cfe46 to
7208b55
Compare
|
Accidental deployment attempt due to rebase ^ |
|
End-to-end training successfull on an A100. /api/v1/healthz -> {"status":"ok"} Added GPU tests (need anyscale creds to run) GPU tests on A100:
|
|
@tanmaysachan Thanks a lot for implementing this, this is excellent work! I'd like to get something like the following working (I think we should be able to support I can look a little more into what is needed to make this work :) |
|
Let me first merge this PR, and then we can implement |
|
/gemini review |
|
Sure, I can add GLM in a followup (this one is quite bloated already 😅) |
| top_k_weights, top_k_index = self._compute_routing(router_logits) | ||
|
|
||
| expert_output = self.experts(hidden_states_flat, top_k_index, top_k_weights, adapter_indices_flat) | ||
| shared_output = self.shared_experts( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reminder to self: It will be great if we can make MLP layers also support flattened states going forward, so this reshaping can be removed. It would also make the layer chunking #902 nicer. Shouldn't be hard to implement, mainly needs some refactoring of the LoRAMixin
skyrl-tx/tx/utils/models.py
Outdated
| continue | ||
| if "experts" in path: | ||
| tensors[key] = np.stack([tensors[get_expert_key(path, i)].T for i in range(config.num_experts)], axis=0) | ||
| num_experts = getattr(config, "num_experts", None) or getattr(config, "n_routed_experts") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO: I think this can be handled in a unified way by ModelConfig, e.g. by exposing a get_num_experts method
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed
|
I'll need a little more time tomorrow to finish reviewing the PR, in particular the attention part and the tests, but so far it looks great! I pushed some small fixes and cleanups (if you end up working on it some more, don't forget to pull first). Hopefully we can get it merged tomorrow! |
Addresses #865