Add DeepSeek V4 by ArthurZucker · Pull Request #45616 · huggingface/transformers

ArthurZucker · 2026-04-24T05:58:31Z

Draft moved to #45643

Initial modular implementation covering DeepSeek-V4-Flash/Pro and their -Base siblings (all share the same architecture). New pieces vs V3.2: * Sliding-window attention with a per-layer KV Compressor (learned gated pooling) and an Indexer selecting top-k compressed positions for long-range attention. No MLA. * Hyper-Connections replace the residual stream (always on). * Mixtral-style top-k MoE routing, no expert groups. First num_hash_layers layers route via a frozen tid2eid lookup keyed by input token ids. * Per-head learnable attention sink; grouped low-rank output projection. MTP weights in the checkpoint are ignored on load (added elsewhere). Eager-only attention for now — SDPA/flash backends do not yet support the sink term.

HuggingFaceDocBuilderDev · 2026-04-24T06:08:35Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

* RoPE: drop custom embedding; use LlamaRotaryEmbedding. qk_rope_head_dim is honoured via rope_parameters['partial_rotary_factor'] which routes through the shared partial-aware init path. Main vs compressed rope bases built via a small helper (_build_rotary) at the Model level. * RoPE apply: use apply_rotary_pos_emb_interleave from V3 for q/k rope slice (V4 reference uses interleaved-pair rotation via complex mul). * Attention sink: port eager_attention_forward from GPT-OSS verbatim (renamed 'attn_sink' -> 'sinks' to match checkpoint/HF naming). * SwiGLU clamp: match GPT-OSS clamp semantics on routed experts; shared expert is unclipped. Inlined into a forward override on DeepseekV4Experts to stay compatible with @use_experts_implementation. * Compressor/Indexer statefulness: both are stateless now. State lives on a new DeepseekV4Cache(DynamicCache) — per-layer compressor_state and indexer_state dicts (buffer_kv, buffer_gate, pooled_kv). Window K/V continues through DynamicCache's DynamicSlidingWindowLayer. * Remove _project_q / _project_kv helpers; fold into forward. * Remove _score_fn; use ACT2FN via a tiny _resolve_activation wrapper that also understands 'sqrtsoftplus' (not in the global registry). * HyperConnection: single module with a forward that wraps an inner callable and does pre-reduce -> inner -> post-expand. attn_hc and mlp_hc are now invoked through __call__. * MLP: packed gate_up_proj; shared expert uses it too. * Hash + top-k routers: unconditional norm_topk_prob normalisation (V4 ships with norm_topk_prob=True; dropped the conditional). * hc_head + final RMSNorm live on DeepseekV4Model, not ForCausalLM. Matches the standard transformers contract: Model returns [B, S, hidden], ForCausalLM only owns lm_head. Tests (4) pass. ruff + check_config_attributes clean.

* Config inherits DeepseekV3Config; V3 MLA/group fields set to None and allow-listed in check_config_attributes; skip V3 __post_init__ so V4's head_dim=512 is preserved. * RMSNorm + RotaryEmbedding inherit V3 classes directly (no rebuild). Main + compress rotary built inline in Model by swapping rope_parameters on a copy.copy(config). * Drop _SqrtSoftplus / _resolve_activation. Routers use ACT2FN where possible; sqrtsoftplus fallback is an inline F.softplus(x).sqrt(). * Drop _build_rotary helper. * Cache: DeepseekV4SlidingLayer stores K=V once (no double-update); DeepseekV4Cache installs those layers + compressor/indexer state. * DeepseekV4GroupedLinear is an nn.Linear subclass for the grouped low-rank output projection — quantizers keyed on .weight still see a valid (out, in) shape; forward does per-group bmm. * Remove module-level DeepseekV4Experts.forward monkey-patch; proper @use_experts_implementation class with clamp inline in forward. * Shared expert inherits Qwen2MoeMLP (packed gate/up not used there — V3/Qwen2MoE convention). * DeepseekV4TopKRouter / HashRouter: standalone, same weight+bias layout as V3, V4 scoring/renorm inline. Hash router's forward computes logits inline (no super() chain into V3 to survive modular conversion). * HyperConnection: single module, forward(hidden_states, inner, layernorm, **kwargs); decoder layer calls attn_hc(...) and mlp_hc(...) directly — no _attn_inner / _mlp_inner callbacks. * hc_head + final RMSNorm live on DeepseekV4Model. * DeepseekV4ForCausalLM only defines __init__; forward is inherited from MixtralForCausalLM unchanged.

The container image lacks native deepseek_v4 model type registration. Install from huggingface/transformers#45616 (ArthurZucker/add-deepseek-v4) to resolve the KeyError at config loading. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

HyperConnection restructure: * DeepseekV4HyperConnection now owns 'inner' (attn/mlp) and 'norm' (the per-site RMSNorm). Decoder-layer forward collapses to hidden_states = self.attn_hc(hidden_states, **kwargs) return self.mlp_hc(hidden_states, **kwargs) No more passing submodules as call arguments. * DeepseekV4SparseMoeBlock.forward accepts **_ so the shared kwargs flow works for both attn and mlp sites. * Hash router falls back to top-k over the learned gate weight when input_ids isn't threaded (inputs_embeds inference path). Conversion mapping: * New 'deepseek_v4' entry in src/transformers/conversion_mapping.py with four WeightRenaming rules mapping the standard decoder-layer names (self_attn, input_layernorm, mlp, post_attention_layernorm) onto the new HC-owned module tree ({attn,mlp}_hc.{inner,norm}). Config + RoPE: * rope_scaling removed (it's a property alias of rope_parameters on PreTrainedConfig; declaring it as a field made both mutate each other and broke to_dict roundtrips). * partial_rotary_factor is a config field and is set to qk_rope_head_dim / head_dim when absent; this is the HF-standard mechanism for sizing cos/sin to the rope-only portion of each head. * DeepseekV4RotaryEmbedding overrides compute_default_rope_parameters to honour partial_rotary_factor on the default rope path as well. * compress_rope_parameters derived from rope_parameters at __post_init__ with rope_theta swapped. * compress_ratios: accept either num_hidden_layers or +MTP length, truncate to num_hidden_layers. Stateless Compressor/Indexer: * Both read/write state on the cache via getattr(..., state_key, None) so plain DynamicCache instances (generation default) work without crashing; stateful optimisation only kicks in with DeepseekV4Cache. Tests: * test_modeling_deepseek_v4.py now inherits CausalLMModelTest + CausalLMModelTester. Model-specific config attrs are declared on the tester class; get_config() threads them into DeepseekV4Config. * Pipeline tests skipped (V4 has no ForSequenceClassification / ForTokenClassification / ForQuestionAnswering heads). * 78 of 127 non-skipped tests pass; remaining failures are specific edge cases (gradient-checkpointing, torch.compile, rope scaling variants) to chase in follow-ups.

…/Experts HyperConnection: * Drop DeepseekV4HyperConnection module. HC is now three free helper functions (_hyper_connection_weights, _hyper_connection_collapse, _hyper_connection_expand) plus layer-level parameters on DeepseekV4DecoderLayer (hc_attn_*, hc_ffn_*), matching the upstream reference naming and keeping the decoder-layer forward readable: collapse → norm → self_attn → expand (attention site) collapse → norm → mlp → expand (mlp site) Module tree matches the checkpoint's standard self_attn / mlp / input_layernorm / post_attention_layernorm; conversion_mapping entry dropped. Compressor / Indexer: * Compressor MAY own an Indexer (only when compress_ratio == 4); the Indexer no longer owns a nested Compressor — it runs its own pooling inline at index_head_dim. * Compressor.forward returns the final long-range KV segment for the layer (indexer-filtered if applicable); attention just does torch.cat without gather / topk logic of its own. Experts: * DeepseekV4Experts inherits GptOssExperts (packed gate_up_proj, per-expert loop, _apply_gate hook). V4's _apply_gate: chunk(2), clamp gate/up by swiglu_limit, SiLU * up. No biases. Routers: * DeepseekV4TopKRouter inherits MixtralTopKRouter; adds the V4 scoring_func (via ACT2FN) and the learnable noaux_tc correction bias buffer (not a Parameter — biases argmax only, no gradient path). * DeepseekV4HashRouter inherits DeepseekV4TopKRouter, drops bias, adds the tid2eid lookup buffer. Raises cleanly when input_ids is missing (inputs_embeds path is unsupported for num_hash_layers > 0). Activations: * Add SqrtSoftplusActivation to the global ACT2FN registry so router scoring is a one-line ACT2FN[name] lookup with no local fallback. Tests: * Switch to CausalLMModelTester defaults via __init__ kwargs; force num_hidden_layers=2, compress_ratios=[0, 4], num_hash_layers=0 so the inputs_embeds generation tests in CausalLMModelTest run. Extra V4-specific tests (hash routing, compressor, attention sink) carried in separate methods. * Override _check_past_key_values_for_generate to accept the sliding-window-truncated K/V shapes (every V4 layer is SWA). * Override _check_attentions_for_generate / _check_hidden_states_for_generate to accept per-layer compressor KV expansion and the hc_mult stream axis. * test_all_params_have_gradient = False — indexer params go through a non-differentiable argmax; the upstream recipe trains them through a separate objective. Status: 109/122 tests pass, 13 known failures (TP on MPS [env], a few numerical-match and compile-related generation tests).

* Compressor and Indexer are now pure math. All state accounting (per-layer pre-pool buffers, running pooled cache) is managed via two free helpers, _accumulate_windows and _update_pool, that live on the cache instance (DeepseekV4Cache or, defensively, any DynamicCache). Single cache update per call, mirroring past_key_values.update(k, v) semantics. * Compressor.forward always returns a tensor (empty shape [B, 1, 0, D] when no window has closed yet) — no more None code paths. * Indexer no longer owns a nested Compressor; it pools inline through the same helpers with a distinct state_key. Only the Compressor owns the Indexer, never the other way around. * Cache-type polymorphism: the helpers work on plain DynamicCache too (generation installs one by default), so V4 works with any Cache subclass without requiring our custom class. * Inline the HC 'collapse' step in the decoder layer — it's a one-liner. Keep _hyper_connection_weights (shared mix-logit machinery) and _hyper_connection_expand (post·out + comb·streams) as helpers. * Add ASCII diagrams to _hyper_connection_weights and DeepseekV4DecoderLayer explaining the HC pipeline vs the classic residual decoder layer. * Add a block comment in DeepseekV4Attention.forward explaining *why* the output's rope slice is un-rotated (V shares with K in V4, so attention outputs carry position-entangled content on the rope dims; conjugate rotation at the query position pulls it back into a position-independent frame before the output projection). HC parameters are cast to fp32 at use time for Sinkhorn stability. Tests: 109/125 pass, 16 known failures (TP on MPS [env], torch.compile paths, a numerical-match test sensitive to the attention sink under padding).

RoPE: * Use Llama's standard apply_rotary_pos_emb (rotate_half + cat(freqs, freqs)) instead of V3's apply_rotary_pos_emb_interleave, which did a rearrange- then-rotate round trip. * DeepseekV4RotaryEmbedding inherits DeepseekV3RotaryEmbedding and only overrides compute_default_rope_parameters to honour partial_rotary_factor so cos/sin comes out sized to qk_rope_head_dim. Hyper-Connections: * DeepseekV4HyperConnection is now a proper nn.Module owning (fn, base, scale). Each decoder layer has two instances (attn_hc, ffn_hc) and calls .compute_weights(hidden_streams) -> (pre, post, comb) on each site. * The stream collapse and expand math is inlined in the decoder layer — two lines each — with matching ASCII diagrams on the class docstring. * Checkpoint keys (hc_attn_{fn,base,scale}, hc_ffn_{fn,base,scale}) are bridged to attn_hc.* / ffn_hc.* via conversion_mapping.py. Cache: * accumulate_windows / update_pool are methods on DeepseekV4Cache. * DeepseekV4Cache.adopt coerces incoming caches: DynamicCache (generation default) gets its class reinterpreted in place; StaticCache and friends get the methods bolt-on-ed. The state store is created lazily. * Ephemeral adopt at the Attention boundary handles the grad-checkpoint pass where past_key_values is stripped. Other: * DeepseekV4HashRouter inherits MixtralTopKRouter directly again (not V4TopKRouter — the chain breaks the modular converter). * Remove the one-shot Indexer._pooled_kv helper; pool inline in forward. Status: 112/121 tests pass. Remaining 9 are 3 TP tests that need real multi-GPU (skipped locally on MPS), 2 torch.compile paths (precompiled- header cache issue on this host), 2 left-padding numerical tests (attention sink + compressor aren't exactly padding-invariant).

The two `shape > 0` checks in DeepseekV4Compressor.forward were paranoid: PyTorch handles empty tensors cleanly through rotary application and the indexer gather. Removed; the `cache.update_pool` path already short- circuits when no window has closed. Add `DeepseekV4ParityTest` with four tiny-config checks that exercise the V4-specific pieces against from-scratch reference math: * `test_compressor_pool_matches_reference` — re-derives the upstream `Compressor._pool` (softmax-gated sum with learned absolute position embedding) in-line and compares to `_pool_windows`. * `test_compressor_cache_accumulates_across_calls` — feeds the same hidden states one token at a time vs. all at once; the running pool must be byte-identical. Covers the cache's window-buffer semantics. * `test_tiny_forward_is_deterministic_and_finite` — end-to-end smoke on a 10-token input, asserts shape / finiteness / determinism. * `test_tiny_generate_runs` — greedy-generates 4 tokens on top of a 6-token prompt, exercises the full generation loop (adopt cache, sliding-window K=V, compressor state, HC mixer, indexer gather). Results: 112/121 CausalLMModelTest pass + 4/4 V4-specific parity tests.

2020zyc · 2026-04-24T12:56:03Z

Hello, thank you for your adaptation work. May I ask if it is ready to use now? Or when is it expected to be available? Looking forward to your reply.

AmineDiro · 2026-04-24T20:04:02Z

+        "layers.*.mlp.shared_experts.up_proj": "colwise",
+        "layers.*.mlp.shared_experts.down_proj": "rowwise",
+    }
+    base_model_pp_plan = {


base_ep_plan 🙈 🙈 or too soon ?

Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>

ArthurZucker · 2026-04-24T23:42:52Z

Had to take a small break but ETA is monday / tuesday

github-actions · 2026-04-24T23:43:22Z

[For maintainers] Suggested jobs to run (before merge)

run-slow: auto, deepseek_v4

Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>

…s into add-deepseek-v4

ArthurZucker · 2026-04-25T00:05:05Z

Superseded by #45643 (same branch, hosted on origin).

github-actions · 2026-04-25T00:14:49Z

View the CircleCI Test Summary for this PR:

https://huggingface.co/spaces/transformers-community/circle-ci-viz?pr=45616&sha=9a4b9f

Blaizzy mentioned this pull request Apr 24, 2026

Add DeepSeek-V4 model and safetensors/tokenizer fallback handling ml-explore/mlx-lm#1190

Closed

ArthurZucker added 3 commits April 24, 2026 16:49

updates

09cc694

tarekziade mentioned this pull request Apr 24, 2026

Add deepseek v4 tarekziade/tarekziade-transformers-reviewer-test#4

Closed

ArthurZucker added 3 commits April 24, 2026 18:25

louzongzhi mentioned this pull request Apr 24, 2026

Add unified Cache-layer management for GLM-5 DSA Indexer keys #45595

Closed

6 tasks

AmineDiro reviewed Apr 24, 2026

View reviewed changes

Blaizzy mentioned this pull request Apr 24, 2026

fix(deepseek_v4): numerical stability, mask padding, and tokenizer loading Blaizzy/mlx-lm#14

Merged

sbhavani mentioned this pull request Apr 24, 2026

DeepSeek-V4 training support NVIDIA/Megatron-LM#4468

Open

qgallouedec reviewed Apr 24, 2026

View reviewed changes

Comment thread src/transformers/models/deepseek_v4/modeling_deepseek_v4.py Outdated

Update src/transformers/models/deepseek_v4/modeling_deepseek_v4.py

097a216

Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>