Skip to content

fix: add mapping of deepseek_v32 model type#42767

Open
mpashkovskii wants to merge 3 commits intohuggingface:mainfrom
mpashkovskii:fix/add-deepseek_v32
Open

fix: add mapping of deepseek_v32 model type#42767
mpashkovskii wants to merge 3 commits intohuggingface:mainfrom
mpashkovskii:fix/add-deepseek_v32

Conversation

@mpashkovskii
Copy link
Copy Markdown

@mpashkovskii mpashkovskii commented Dec 10, 2025

What does this PR do?

Adds the missing mapping for model type deepseek_v32 to deepseek_v3 model and DeepseekV3Config

Fixes #42590

Before submitting

  • (almost) This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Who can review?

@Cyrilvallez could you please review the changes?

@mpashkovskii
Copy link
Copy Markdown
Author

I think the tests are failing because of the irrelevant ResNet precision error.

@huzama
Copy link
Copy Markdown

huzama commented Dec 11, 2025

I noticed that deepseek-ai/DeepSeek-V3.2 uses DeepSeek's native sparse attention. Does the current deepseek_v3 architecture support this? I don't see the Indexer or selector in the code here, so I wonder if this mapping is safe:

if self.q_lora_rank is None:
q_states = self.q_proj(hidden_states)
else:
q_states = self.q_b_proj(self.q_a_layernorm(self.q_a_proj(hidden_states)))
q_states = q_states.view(query_shape).transpose(1, 2)
q_pass, q_rot = torch.split(q_states, [self.qk_nope_head_dim, self.qk_rope_head_dim], dim=-1)
compressed_kv = self.kv_a_proj_with_mqa(hidden_states)
k_pass, k_rot = torch.split(compressed_kv, [self.kv_lora_rank, self.qk_rope_head_dim], dim=-1)
k_pass = self.kv_b_proj(self.kv_a_layernorm(k_pass)).view(key_shape).transpose(1, 2)
k_pass, value_states = torch.split(k_pass, [self.qk_nope_head_dim, self.v_head_dim], dim=-1)
k_rot = k_rot.view(batch_size, 1, seq_length, self.qk_rope_head_dim)
cos, sin = position_embeddings
if self.config.rope_interleave: # support using interleaved weights for efficiency
q_rot, k_rot = apply_rotary_pos_emb_interleave(q_rot, k_rot, cos, sin)
else:
q_rot, k_rot = apply_rotary_pos_emb(q_rot, k_rot, cos, sin)
k_rot = k_rot.expand(*k_pass.shape[:-1], -1)
query_states = torch.cat((q_pass, q_rot), dim=-1)
key_states = torch.cat((k_pass, k_rot), dim=-1)
if past_key_values is not None:
# sin and cos are specific to RoPE models; cache_position needed for the static cache
cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position}
key_states, value_states = past_key_values.update(key_states, value_states, self.layer_idx, cache_kwargs)
if self.config._attn_implementation == "flash_attention_2" and self.qk_head_dim != self.v_head_dim:
value_states = F.pad(value_states, [0, self.qk_head_dim - self.v_head_dim])

@Rocketknight1
Copy link
Copy Markdown
Member

Yes, I don't think we can just map the new model to the old architecture!

@github-actions
Copy link
Copy Markdown
Contributor

[For maintainers] Suggested jobs to run (before merge)

run-slow: auto, deepseek_v32

@mpashkovskii
Copy link
Copy Markdown
Author

Hi @huzama and @Rocketknight1, thanks for pointing that out. I’ve added the initial DeepSeek v3.2 implementation, but it still needs more testing and validation. I’d appreciate any feedback you have.

Do you know if anyone else is actively working on this? If so, does it make sense to complete the implementation in this PR?

@github-actions
Copy link
Copy Markdown
Contributor

View the CircleCI Test Summary for this PR:

https://huggingface.co/spaces/transformers-community/circle-ci-viz?pr=42767&sha=f17882

@huzama
Copy link
Copy Markdown

huzama commented Dec 15, 2025

@mpashkovskii, I’m working on implementing an indexer and top k feature for a personal project. However, there are some minor changes needed to make it into a pull request.

You can try writing the code for the Indexer of DSA yourself. Alternatively, once I have a well-drafted version, I can also push the changes.

@freedom-cui
Copy link
Copy Markdown

Hello @mpashkovskii @huzama
Does this temporary PR already support dpsk-v3.2?

@freedom-cui
Copy link
Copy Markdown

@mpashkovskii hello
When I used your Pr, I noticed that the loaded model uses the DeepSeekv3 model structure instead of DeepSeekv32.

from transformers import AutoConfig, AutoTokenizer, AutoModelForCausalLM

model_name_or_path = "DeepSeek-v3.2"

config = AutoConfig.from_pretrained(model_name_or_path, trust_remote_code=True)

model = AutoModelForCausalLM.from_config(
    config=config, trust_remote_code=True, attn_implementation="flash_attention_2", torch_dtype=dtype
)

@huzama
Copy link
Copy Markdown

huzama commented Dec 23, 2025

@freedom-cui the model is not implemented yet as of last commit. If you need only inference please check out VLLM library!

@freedom-cui
Copy link
Copy Markdown

@freedom-cui the model is not implemented yet as of last commit. If you need only inference please check out VLLM library!

Thank you very much for your reply. Is there a complete schedule available for supporting Deepseek v3.2 at this time?

@vasqu
Copy link
Copy Markdown
Contributor

vasqu commented Jan 12, 2026

Please see #41251 (comment) cc @ArthurZucker

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

model type deepseek_v32

5 participants