Skip to content

Add DeepSeek V4#45616

Closed
ArthurZucker wants to merge 14 commits intohuggingface:mainfrom
ArthurZucker:add-deepseek-v4
Closed

Add DeepSeek V4#45616
ArthurZucker wants to merge 14 commits intohuggingface:mainfrom
ArthurZucker:add-deepseek-v4

Conversation

@ArthurZucker
Copy link
Copy Markdown
Collaborator

@ArthurZucker ArthurZucker commented Apr 24, 2026

Draft moved to #45643

Initial modular implementation covering DeepSeek-V4-Flash/Pro and their
-Base siblings (all share the same architecture). New pieces vs V3.2:

* Sliding-window attention with a per-layer KV Compressor (learned gated
  pooling) and an Indexer selecting top-k compressed positions for
  long-range attention. No MLA.
* Hyper-Connections replace the residual stream (always on).
* Mixtral-style top-k MoE routing, no expert groups. First num_hash_layers
  layers route via a frozen tid2eid lookup keyed by input token ids.
* Per-head learnable attention sink; grouped low-rank output projection.

MTP weights in the checkpoint are ignored on load (added elsewhere).
Eager-only attention for now — SDPA/flash backends do not yet support
the sink term.
@HuggingFaceDocBuilderDev
Copy link
Copy Markdown

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

* RoPE: drop custom embedding; use LlamaRotaryEmbedding. qk_rope_head_dim
  is honoured via rope_parameters['partial_rotary_factor'] which routes
  through the shared partial-aware init path. Main vs compressed rope
  bases built via a small helper (_build_rotary) at the Model level.
* RoPE apply: use apply_rotary_pos_emb_interleave from V3 for q/k rope
  slice (V4 reference uses interleaved-pair rotation via complex mul).
* Attention sink: port eager_attention_forward from GPT-OSS verbatim
  (renamed 'attn_sink' -> 'sinks' to match checkpoint/HF naming).
* SwiGLU clamp: match GPT-OSS clamp semantics on routed experts; shared
  expert is unclipped. Inlined into a forward override on DeepseekV4Experts
  to stay compatible with @use_experts_implementation.
* Compressor/Indexer statefulness: both are stateless now. State lives on
  a new DeepseekV4Cache(DynamicCache) — per-layer compressor_state and
  indexer_state dicts (buffer_kv, buffer_gate, pooled_kv). Window K/V
  continues through DynamicCache's DynamicSlidingWindowLayer.
* Remove _project_q / _project_kv helpers; fold into forward.
* Remove _score_fn; use ACT2FN via a tiny _resolve_activation wrapper
  that also understands 'sqrtsoftplus' (not in the global registry).
* HyperConnection: single module with a forward that wraps an inner
  callable and does pre-reduce -> inner -> post-expand. attn_hc and
  mlp_hc are now invoked through __call__.
* MLP: packed gate_up_proj; shared expert uses it too.
* Hash + top-k routers: unconditional norm_topk_prob normalisation
  (V4 ships with norm_topk_prob=True; dropped the conditional).
* hc_head + final RMSNorm live on DeepseekV4Model, not ForCausalLM.
  Matches the standard transformers contract: Model returns
  [B, S, hidden], ForCausalLM only owns lm_head.

Tests (4) pass. ruff + check_config_attributes clean.
* Config inherits DeepseekV3Config; V3 MLA/group fields set to None
  and allow-listed in check_config_attributes; skip V3 __post_init__
  so V4's head_dim=512 is preserved.
* RMSNorm + RotaryEmbedding inherit V3 classes directly (no rebuild).
  Main + compress rotary built inline in Model by swapping
  rope_parameters on a copy.copy(config).
* Drop _SqrtSoftplus / _resolve_activation. Routers use ACT2FN where
  possible; sqrtsoftplus fallback is an inline F.softplus(x).sqrt().
* Drop _build_rotary helper.
* Cache: DeepseekV4SlidingLayer stores K=V once (no double-update);
  DeepseekV4Cache installs those layers + compressor/indexer state.
* DeepseekV4GroupedLinear is an nn.Linear subclass for the grouped
  low-rank output projection — quantizers keyed on .weight still see
  a valid (out, in) shape; forward does per-group bmm.
* Remove module-level DeepseekV4Experts.forward monkey-patch; proper
  @use_experts_implementation class with clamp inline in forward.
* Shared expert inherits Qwen2MoeMLP (packed gate/up not used there —
  V3/Qwen2MoE convention).
* DeepseekV4TopKRouter / HashRouter: standalone, same weight+bias
  layout as V3, V4 scoring/renorm inline. Hash router's forward
  computes logits inline (no super() chain into V3 to survive
  modular conversion).
* HyperConnection: single module, forward(hidden_states, inner,
  layernorm, **kwargs); decoder layer calls attn_hc(...) and
  mlp_hc(...) directly — no _attn_inner / _mlp_inner callbacks.
* hc_head + final RMSNorm live on DeepseekV4Model.
* DeepseekV4ForCausalLM only defines __init__; forward is inherited
  from MixtralForCausalLM unchanged.
Oseltamivir added a commit to SemiAnalysisAI/InferenceX that referenced this pull request Apr 24, 2026
The container image lacks native deepseek_v4 model type registration.
Install from huggingface/transformers#45616 (ArthurZucker/add-deepseek-v4)
to resolve the KeyError at config loading.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
HyperConnection restructure:
* DeepseekV4HyperConnection now owns 'inner' (attn/mlp) and 'norm'
  (the per-site RMSNorm). Decoder-layer forward collapses to
      hidden_states = self.attn_hc(hidden_states, **kwargs)
      return self.mlp_hc(hidden_states, **kwargs)
  No more passing submodules as call arguments.
* DeepseekV4SparseMoeBlock.forward accepts **_ so the shared kwargs
  flow works for both attn and mlp sites.
* Hash router falls back to top-k over the learned gate weight when
  input_ids isn't threaded (inputs_embeds inference path).

Conversion mapping:
* New 'deepseek_v4' entry in src/transformers/conversion_mapping.py
  with four WeightRenaming rules mapping the standard decoder-layer
  names (self_attn, input_layernorm, mlp, post_attention_layernorm)
  onto the new HC-owned module tree ({attn,mlp}_hc.{inner,norm}).

Config + RoPE:
* rope_scaling removed (it's a property alias of rope_parameters on
  PreTrainedConfig; declaring it as a field made both mutate each
  other and broke to_dict roundtrips).
* partial_rotary_factor is a config field and is set to
  qk_rope_head_dim / head_dim when absent; this is the HF-standard
  mechanism for sizing cos/sin to the rope-only portion of each head.
* DeepseekV4RotaryEmbedding overrides compute_default_rope_parameters
  to honour partial_rotary_factor on the default rope path as well.
* compress_rope_parameters derived from rope_parameters at __post_init__
  with rope_theta swapped.
* compress_ratios: accept either num_hidden_layers or +MTP length,
  truncate to num_hidden_layers.

Stateless Compressor/Indexer:
* Both read/write state on the cache via getattr(..., state_key, None)
  so plain DynamicCache instances (generation default) work without
  crashing; stateful optimisation only kicks in with DeepseekV4Cache.

Tests:
* test_modeling_deepseek_v4.py now inherits CausalLMModelTest +
  CausalLMModelTester. Model-specific config attrs are declared on
  the tester class; get_config() threads them into DeepseekV4Config.
* Pipeline tests skipped (V4 has no ForSequenceClassification /
  ForTokenClassification / ForQuestionAnswering heads).
* 78 of 127 non-skipped tests pass; remaining failures are specific
  edge cases (gradient-checkpointing, torch.compile, rope scaling
  variants) to chase in follow-ups.
…/Experts

HyperConnection:
* Drop DeepseekV4HyperConnection module. HC is now three free helper
  functions (_hyper_connection_weights, _hyper_connection_collapse,
  _hyper_connection_expand) plus layer-level parameters on
  DeepseekV4DecoderLayer (hc_attn_*, hc_ffn_*), matching the upstream
  reference naming and keeping the decoder-layer forward readable:

      collapse → norm → self_attn → expand     (attention site)
      collapse → norm → mlp       → expand     (mlp site)

  Module tree matches the checkpoint's standard self_attn / mlp /
  input_layernorm / post_attention_layernorm; conversion_mapping entry
  dropped.

Compressor / Indexer:
* Compressor MAY own an Indexer (only when compress_ratio == 4); the
  Indexer no longer owns a nested Compressor — it runs its own pooling
  inline at index_head_dim.
* Compressor.forward returns the final long-range KV segment for
  the layer (indexer-filtered if applicable); attention just does
  torch.cat without gather / topk logic of its own.

Experts:
* DeepseekV4Experts inherits GptOssExperts (packed gate_up_proj,
  per-expert loop, _apply_gate hook). V4's _apply_gate: chunk(2),
  clamp gate/up by swiglu_limit, SiLU * up. No biases.

Routers:
* DeepseekV4TopKRouter inherits MixtralTopKRouter; adds the V4
  scoring_func (via ACT2FN) and the learnable noaux_tc correction bias
  buffer (not a Parameter — biases argmax only, no gradient path).
* DeepseekV4HashRouter inherits DeepseekV4TopKRouter, drops bias, adds
  the tid2eid lookup buffer. Raises cleanly when input_ids is missing
  (inputs_embeds path is unsupported for num_hash_layers > 0).

Activations:
* Add SqrtSoftplusActivation to the global ACT2FN registry so router
  scoring is a one-line ACT2FN[name] lookup with no local fallback.

Tests:
* Switch to CausalLMModelTester defaults via __init__ kwargs; force
  num_hidden_layers=2, compress_ratios=[0, 4], num_hash_layers=0 so the
  inputs_embeds generation tests in CausalLMModelTest run. Extra
  V4-specific tests (hash routing, compressor, attention sink) carried
  in separate methods.
* Override _check_past_key_values_for_generate to accept the
  sliding-window-truncated K/V shapes (every V4 layer is SWA).
* Override _check_attentions_for_generate / _check_hidden_states_for_generate
  to accept per-layer compressor KV expansion and the hc_mult stream axis.
* test_all_params_have_gradient = False — indexer params go through
  a non-differentiable argmax; the upstream recipe trains them through
  a separate objective.

Status: 109/122 tests pass, 13 known failures (TP on MPS [env], a few
numerical-match and compile-related generation tests).
* Compressor and Indexer are now pure math. All state accounting
  (per-layer pre-pool buffers, running pooled cache) is managed via
  two free helpers, _accumulate_windows and _update_pool, that live on
  the cache instance (DeepseekV4Cache or, defensively, any DynamicCache).
  Single cache update per call, mirroring past_key_values.update(k, v)
  semantics.
* Compressor.forward always returns a tensor (empty shape [B, 1, 0, D]
  when no window has closed yet) — no more None code paths.
* Indexer no longer owns a nested Compressor; it pools inline through
  the same helpers with a distinct state_key. Only the Compressor owns
  the Indexer, never the other way around.
* Cache-type polymorphism: the helpers work on plain DynamicCache too
  (generation installs one by default), so V4 works with any
  Cache subclass without requiring our custom class.

* Inline the HC 'collapse' step in the decoder layer — it's a one-liner.
  Keep _hyper_connection_weights (shared mix-logit machinery) and
  _hyper_connection_expand (post·out + comb·streams) as helpers.
* Add ASCII diagrams to _hyper_connection_weights and
  DeepseekV4DecoderLayer explaining the HC pipeline vs the classic
  residual decoder layer.
* Add a block comment in DeepseekV4Attention.forward explaining *why*
  the output's rope slice is un-rotated (V shares with K in V4, so
  attention outputs carry position-entangled content on the rope dims;
  conjugate rotation at the query position pulls it back into a
  position-independent frame before the output projection).

HC parameters are cast to fp32 at use time for Sinkhorn stability.

Tests: 109/125 pass, 16 known failures (TP on MPS [env], torch.compile
paths, a numerical-match test sensitive to the attention sink under
padding).
RoPE:
* Use Llama's standard apply_rotary_pos_emb (rotate_half + cat(freqs, freqs))
  instead of V3's apply_rotary_pos_emb_interleave, which did a rearrange-
  then-rotate round trip.
* DeepseekV4RotaryEmbedding inherits DeepseekV3RotaryEmbedding and only
  overrides compute_default_rope_parameters to honour partial_rotary_factor
  so cos/sin comes out sized to qk_rope_head_dim.

Hyper-Connections:
* DeepseekV4HyperConnection is now a proper nn.Module owning (fn, base,
  scale). Each decoder layer has two instances (attn_hc, ffn_hc) and calls
  .compute_weights(hidden_streams) -> (pre, post, comb) on each site.
* The stream collapse and expand math is inlined in the decoder layer —
  two lines each — with matching ASCII diagrams on the class docstring.
* Checkpoint keys (hc_attn_{fn,base,scale}, hc_ffn_{fn,base,scale}) are
  bridged to attn_hc.* / ffn_hc.* via conversion_mapping.py.

Cache:
* accumulate_windows / update_pool are methods on DeepseekV4Cache.
* DeepseekV4Cache.adopt coerces incoming caches: DynamicCache (generation
  default) gets its class reinterpreted in place; StaticCache and friends
  get the methods bolt-on-ed. The state store is created lazily.
* Ephemeral adopt at the Attention boundary handles the grad-checkpoint
  pass where past_key_values is stripped.

Other:
* DeepseekV4HashRouter inherits MixtralTopKRouter directly again (not
  V4TopKRouter — the chain breaks the modular converter).
* Remove the one-shot Indexer._pooled_kv helper; pool inline in forward.

Status: 112/121 tests pass. Remaining 9 are 3 TP tests that need real
multi-GPU (skipped locally on MPS), 2 torch.compile paths (precompiled-
header cache issue on this host), 2 left-padding numerical tests
(attention sink + compressor aren't exactly padding-invariant).
The two `shape > 0` checks in DeepseekV4Compressor.forward were paranoid:
PyTorch handles empty tensors cleanly through rotary application and the
indexer gather. Removed; the `cache.update_pool` path already short-
circuits when no window has closed.

Add `DeepseekV4ParityTest` with four tiny-config checks that exercise
the V4-specific pieces against from-scratch reference math:

* `test_compressor_pool_matches_reference` — re-derives the upstream
  `Compressor._pool` (softmax-gated sum with learned absolute position
  embedding) in-line and compares to `_pool_windows`.
* `test_compressor_cache_accumulates_across_calls` — feeds the same
  hidden states one token at a time vs. all at once; the running pool
  must be byte-identical. Covers the cache's window-buffer semantics.
* `test_tiny_forward_is_deterministic_and_finite` — end-to-end smoke
  on a 10-token input, asserts shape / finiteness / determinism.
* `test_tiny_generate_runs` — greedy-generates 4 tokens on top of a
  6-token prompt, exercises the full generation loop (adopt cache,
  sliding-window K=V, compressor state, HC mixer, indexer gather).

Results: 112/121 CausalLMModelTest pass + 4/4 V4-specific parity tests.
@2020zyc
Copy link
Copy Markdown

2020zyc commented Apr 24, 2026

Hello, thank you for your adaptation work. May I ask if it is ready to use now? Or when is it expected to be available? Looking forward to your reply.

"layers.*.mlp.shared_experts.up_proj": "colwise",
"layers.*.mlp.shared_experts.down_proj": "rowwise",
}
base_model_pp_plan = {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

base_ep_plan 🙈 🙈 or too soon ?

Comment thread src/transformers/models/deepseek_v4/modeling_deepseek_v4.py Outdated
Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
@ArthurZucker
Copy link
Copy Markdown
Collaborator Author

Had to take a small break but ETA is monday / tuesday

@github-actions
Copy link
Copy Markdown
Contributor

[For maintainers] Suggested jobs to run (before merge)

run-slow: auto, deepseek_v4

Comment thread src/transformers/models/deepseek_v4/modular_deepseek_v4.py Outdated
Comment thread src/transformers/models/deepseek_v4/modular_deepseek_v4.py
Comment thread src/transformers/models/deepseek_v4/modular_deepseek_v4.py Outdated
Comment thread src/transformers/models/deepseek_v4/modular_deepseek_v4.py Outdated
Comment thread src/transformers/models/deepseek_v4/modular_deepseek_v4.py Outdated
Comment thread src/transformers/models/deepseek_v4/modular_deepseek_v4.py Outdated
ArthurZucker and others added 2 commits April 25, 2026 01:51
Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>
@ArthurZucker ArthurZucker mentioned this pull request Apr 25, 2026
@ArthurZucker
Copy link
Copy Markdown
Collaborator Author

Superseded by #45643 (same branch, hosted on origin).

@github-actions
Copy link
Copy Markdown
Contributor

View the CircleCI Test Summary for this PR:

https://huggingface.co/spaces/transformers-community/circle-ci-viz?pr=45616&sha=9a4b9f

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants