Add DeepSeek V4#45616
Closed
ArthurZucker wants to merge 14 commits intohuggingface:mainfrom
Closed
Conversation
Initial modular implementation covering DeepSeek-V4-Flash/Pro and their -Base siblings (all share the same architecture). New pieces vs V3.2: * Sliding-window attention with a per-layer KV Compressor (learned gated pooling) and an Indexer selecting top-k compressed positions for long-range attention. No MLA. * Hyper-Connections replace the residual stream (always on). * Mixtral-style top-k MoE routing, no expert groups. First num_hash_layers layers route via a frozen tid2eid lookup keyed by input token ids. * Per-head learnable attention sink; grouped low-rank output projection. MTP weights in the checkpoint are ignored on load (added elsewhere). Eager-only attention for now — SDPA/flash backends do not yet support the sink term.
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
* RoPE: drop custom embedding; use LlamaRotaryEmbedding. qk_rope_head_dim is honoured via rope_parameters['partial_rotary_factor'] which routes through the shared partial-aware init path. Main vs compressed rope bases built via a small helper (_build_rotary) at the Model level. * RoPE apply: use apply_rotary_pos_emb_interleave from V3 for q/k rope slice (V4 reference uses interleaved-pair rotation via complex mul). * Attention sink: port eager_attention_forward from GPT-OSS verbatim (renamed 'attn_sink' -> 'sinks' to match checkpoint/HF naming). * SwiGLU clamp: match GPT-OSS clamp semantics on routed experts; shared expert is unclipped. Inlined into a forward override on DeepseekV4Experts to stay compatible with @use_experts_implementation. * Compressor/Indexer statefulness: both are stateless now. State lives on a new DeepseekV4Cache(DynamicCache) — per-layer compressor_state and indexer_state dicts (buffer_kv, buffer_gate, pooled_kv). Window K/V continues through DynamicCache's DynamicSlidingWindowLayer. * Remove _project_q / _project_kv helpers; fold into forward. * Remove _score_fn; use ACT2FN via a tiny _resolve_activation wrapper that also understands 'sqrtsoftplus' (not in the global registry). * HyperConnection: single module with a forward that wraps an inner callable and does pre-reduce -> inner -> post-expand. attn_hc and mlp_hc are now invoked through __call__. * MLP: packed gate_up_proj; shared expert uses it too. * Hash + top-k routers: unconditional norm_topk_prob normalisation (V4 ships with norm_topk_prob=True; dropped the conditional). * hc_head + final RMSNorm live on DeepseekV4Model, not ForCausalLM. Matches the standard transformers contract: Model returns [B, S, hidden], ForCausalLM only owns lm_head. Tests (4) pass. ruff + check_config_attributes clean.
* Config inherits DeepseekV3Config; V3 MLA/group fields set to None and allow-listed in check_config_attributes; skip V3 __post_init__ so V4's head_dim=512 is preserved. * RMSNorm + RotaryEmbedding inherit V3 classes directly (no rebuild). Main + compress rotary built inline in Model by swapping rope_parameters on a copy.copy(config). * Drop _SqrtSoftplus / _resolve_activation. Routers use ACT2FN where possible; sqrtsoftplus fallback is an inline F.softplus(x).sqrt(). * Drop _build_rotary helper. * Cache: DeepseekV4SlidingLayer stores K=V once (no double-update); DeepseekV4Cache installs those layers + compressor/indexer state. * DeepseekV4GroupedLinear is an nn.Linear subclass for the grouped low-rank output projection — quantizers keyed on .weight still see a valid (out, in) shape; forward does per-group bmm. * Remove module-level DeepseekV4Experts.forward monkey-patch; proper @use_experts_implementation class with clamp inline in forward. * Shared expert inherits Qwen2MoeMLP (packed gate/up not used there — V3/Qwen2MoE convention). * DeepseekV4TopKRouter / HashRouter: standalone, same weight+bias layout as V3, V4 scoring/renorm inline. Hash router's forward computes logits inline (no super() chain into V3 to survive modular conversion). * HyperConnection: single module, forward(hidden_states, inner, layernorm, **kwargs); decoder layer calls attn_hc(...) and mlp_hc(...) directly — no _attn_inner / _mlp_inner callbacks. * hc_head + final RMSNorm live on DeepseekV4Model. * DeepseekV4ForCausalLM only defines __init__; forward is inherited from MixtralForCausalLM unchanged.
Oseltamivir
added a commit
to SemiAnalysisAI/InferenceX
that referenced
this pull request
Apr 24, 2026
The container image lacks native deepseek_v4 model type registration. Install from huggingface/transformers#45616 (ArthurZucker/add-deepseek-v4) to resolve the KeyError at config loading. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
HyperConnection restructure:
* DeepseekV4HyperConnection now owns 'inner' (attn/mlp) and 'norm'
(the per-site RMSNorm). Decoder-layer forward collapses to
hidden_states = self.attn_hc(hidden_states, **kwargs)
return self.mlp_hc(hidden_states, **kwargs)
No more passing submodules as call arguments.
* DeepseekV4SparseMoeBlock.forward accepts **_ so the shared kwargs
flow works for both attn and mlp sites.
* Hash router falls back to top-k over the learned gate weight when
input_ids isn't threaded (inputs_embeds inference path).
Conversion mapping:
* New 'deepseek_v4' entry in src/transformers/conversion_mapping.py
with four WeightRenaming rules mapping the standard decoder-layer
names (self_attn, input_layernorm, mlp, post_attention_layernorm)
onto the new HC-owned module tree ({attn,mlp}_hc.{inner,norm}).
Config + RoPE:
* rope_scaling removed (it's a property alias of rope_parameters on
PreTrainedConfig; declaring it as a field made both mutate each
other and broke to_dict roundtrips).
* partial_rotary_factor is a config field and is set to
qk_rope_head_dim / head_dim when absent; this is the HF-standard
mechanism for sizing cos/sin to the rope-only portion of each head.
* DeepseekV4RotaryEmbedding overrides compute_default_rope_parameters
to honour partial_rotary_factor on the default rope path as well.
* compress_rope_parameters derived from rope_parameters at __post_init__
with rope_theta swapped.
* compress_ratios: accept either num_hidden_layers or +MTP length,
truncate to num_hidden_layers.
Stateless Compressor/Indexer:
* Both read/write state on the cache via getattr(..., state_key, None)
so plain DynamicCache instances (generation default) work without
crashing; stateful optimisation only kicks in with DeepseekV4Cache.
Tests:
* test_modeling_deepseek_v4.py now inherits CausalLMModelTest +
CausalLMModelTester. Model-specific config attrs are declared on
the tester class; get_config() threads them into DeepseekV4Config.
* Pipeline tests skipped (V4 has no ForSequenceClassification /
ForTokenClassification / ForQuestionAnswering heads).
* 78 of 127 non-skipped tests pass; remaining failures are specific
edge cases (gradient-checkpointing, torch.compile, rope scaling
variants) to chase in follow-ups.
…/Experts
HyperConnection:
* Drop DeepseekV4HyperConnection module. HC is now three free helper
functions (_hyper_connection_weights, _hyper_connection_collapse,
_hyper_connection_expand) plus layer-level parameters on
DeepseekV4DecoderLayer (hc_attn_*, hc_ffn_*), matching the upstream
reference naming and keeping the decoder-layer forward readable:
collapse → norm → self_attn → expand (attention site)
collapse → norm → mlp → expand (mlp site)
Module tree matches the checkpoint's standard self_attn / mlp /
input_layernorm / post_attention_layernorm; conversion_mapping entry
dropped.
Compressor / Indexer:
* Compressor MAY own an Indexer (only when compress_ratio == 4); the
Indexer no longer owns a nested Compressor — it runs its own pooling
inline at index_head_dim.
* Compressor.forward returns the final long-range KV segment for
the layer (indexer-filtered if applicable); attention just does
torch.cat without gather / topk logic of its own.
Experts:
* DeepseekV4Experts inherits GptOssExperts (packed gate_up_proj,
per-expert loop, _apply_gate hook). V4's _apply_gate: chunk(2),
clamp gate/up by swiglu_limit, SiLU * up. No biases.
Routers:
* DeepseekV4TopKRouter inherits MixtralTopKRouter; adds the V4
scoring_func (via ACT2FN) and the learnable noaux_tc correction bias
buffer (not a Parameter — biases argmax only, no gradient path).
* DeepseekV4HashRouter inherits DeepseekV4TopKRouter, drops bias, adds
the tid2eid lookup buffer. Raises cleanly when input_ids is missing
(inputs_embeds path is unsupported for num_hash_layers > 0).
Activations:
* Add SqrtSoftplusActivation to the global ACT2FN registry so router
scoring is a one-line ACT2FN[name] lookup with no local fallback.
Tests:
* Switch to CausalLMModelTester defaults via __init__ kwargs; force
num_hidden_layers=2, compress_ratios=[0, 4], num_hash_layers=0 so the
inputs_embeds generation tests in CausalLMModelTest run. Extra
V4-specific tests (hash routing, compressor, attention sink) carried
in separate methods.
* Override _check_past_key_values_for_generate to accept the
sliding-window-truncated K/V shapes (every V4 layer is SWA).
* Override _check_attentions_for_generate / _check_hidden_states_for_generate
to accept per-layer compressor KV expansion and the hc_mult stream axis.
* test_all_params_have_gradient = False — indexer params go through
a non-differentiable argmax; the upstream recipe trains them through
a separate objective.
Status: 109/122 tests pass, 13 known failures (TP on MPS [env], a few
numerical-match and compile-related generation tests).
* Compressor and Indexer are now pure math. All state accounting (per-layer pre-pool buffers, running pooled cache) is managed via two free helpers, _accumulate_windows and _update_pool, that live on the cache instance (DeepseekV4Cache or, defensively, any DynamicCache). Single cache update per call, mirroring past_key_values.update(k, v) semantics. * Compressor.forward always returns a tensor (empty shape [B, 1, 0, D] when no window has closed yet) — no more None code paths. * Indexer no longer owns a nested Compressor; it pools inline through the same helpers with a distinct state_key. Only the Compressor owns the Indexer, never the other way around. * Cache-type polymorphism: the helpers work on plain DynamicCache too (generation installs one by default), so V4 works with any Cache subclass without requiring our custom class. * Inline the HC 'collapse' step in the decoder layer — it's a one-liner. Keep _hyper_connection_weights (shared mix-logit machinery) and _hyper_connection_expand (post·out + comb·streams) as helpers. * Add ASCII diagrams to _hyper_connection_weights and DeepseekV4DecoderLayer explaining the HC pipeline vs the classic residual decoder layer. * Add a block comment in DeepseekV4Attention.forward explaining *why* the output's rope slice is un-rotated (V shares with K in V4, so attention outputs carry position-entangled content on the rope dims; conjugate rotation at the query position pulls it back into a position-independent frame before the output projection). HC parameters are cast to fp32 at use time for Sinkhorn stability. Tests: 109/125 pass, 16 known failures (TP on MPS [env], torch.compile paths, a numerical-match test sensitive to the attention sink under padding).
RoPE:
* Use Llama's standard apply_rotary_pos_emb (rotate_half + cat(freqs, freqs))
instead of V3's apply_rotary_pos_emb_interleave, which did a rearrange-
then-rotate round trip.
* DeepseekV4RotaryEmbedding inherits DeepseekV3RotaryEmbedding and only
overrides compute_default_rope_parameters to honour partial_rotary_factor
so cos/sin comes out sized to qk_rope_head_dim.
Hyper-Connections:
* DeepseekV4HyperConnection is now a proper nn.Module owning (fn, base,
scale). Each decoder layer has two instances (attn_hc, ffn_hc) and calls
.compute_weights(hidden_streams) -> (pre, post, comb) on each site.
* The stream collapse and expand math is inlined in the decoder layer —
two lines each — with matching ASCII diagrams on the class docstring.
* Checkpoint keys (hc_attn_{fn,base,scale}, hc_ffn_{fn,base,scale}) are
bridged to attn_hc.* / ffn_hc.* via conversion_mapping.py.
Cache:
* accumulate_windows / update_pool are methods on DeepseekV4Cache.
* DeepseekV4Cache.adopt coerces incoming caches: DynamicCache (generation
default) gets its class reinterpreted in place; StaticCache and friends
get the methods bolt-on-ed. The state store is created lazily.
* Ephemeral adopt at the Attention boundary handles the grad-checkpoint
pass where past_key_values is stripped.
Other:
* DeepseekV4HashRouter inherits MixtralTopKRouter directly again (not
V4TopKRouter — the chain breaks the modular converter).
* Remove the one-shot Indexer._pooled_kv helper; pool inline in forward.
Status: 112/121 tests pass. Remaining 9 are 3 TP tests that need real
multi-GPU (skipped locally on MPS), 2 torch.compile paths (precompiled-
header cache issue on this host), 2 left-padding numerical tests
(attention sink + compressor aren't exactly padding-invariant).
The two `shape > 0` checks in DeepseekV4Compressor.forward were paranoid: PyTorch handles empty tensors cleanly through rotary application and the indexer gather. Removed; the `cache.update_pool` path already short- circuits when no window has closed. Add `DeepseekV4ParityTest` with four tiny-config checks that exercise the V4-specific pieces against from-scratch reference math: * `test_compressor_pool_matches_reference` — re-derives the upstream `Compressor._pool` (softmax-gated sum with learned absolute position embedding) in-line and compares to `_pool_windows`. * `test_compressor_cache_accumulates_across_calls` — feeds the same hidden states one token at a time vs. all at once; the running pool must be byte-identical. Covers the cache's window-buffer semantics. * `test_tiny_forward_is_deterministic_and_finite` — end-to-end smoke on a 10-token input, asserts shape / finiteness / determinism. * `test_tiny_generate_runs` — greedy-generates 4 tokens on top of a 6-token prompt, exercises the full generation loop (adopt cache, sliding-window K=V, compressor state, HC mixer, indexer gather). Results: 112/121 CausalLMModelTest pass + 4/4 V4-specific parity tests.
6 tasks
|
Hello, thank you for your adaptation work. May I ask if it is ready to use now? Or when is it expected to be available? Looking forward to your reply. |
AmineDiro
reviewed
Apr 24, 2026
| "layers.*.mlp.shared_experts.up_proj": "colwise", | ||
| "layers.*.mlp.shared_experts.down_proj": "rowwise", | ||
| } | ||
| base_model_pp_plan = { |
Member
There was a problem hiding this comment.
base_ep_plan 🙈 🙈 or too soon ?
qgallouedec
reviewed
Apr 24, 2026
Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Collaborator
Author
|
Had to take a small break but ETA is monday / tuesday |
Contributor
|
[For maintainers] Suggested jobs to run (before merge) run-slow: auto, deepseek_v4 |
ArthurZucker
commented
Apr 24, 2026
ArthurZucker
commented
Apr 24, 2026
ArthurZucker
commented
Apr 24, 2026
ArthurZucker
commented
Apr 24, 2026
ArthurZucker
commented
Apr 24, 2026
ArthurZucker
commented
Apr 24, 2026
Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>
…s into add-deepseek-v4
Draft
Collaborator
Author
|
Superseded by #45643 (same branch, hosted on origin). |
Contributor
|
View the CircleCI Test Summary for this PR: https://huggingface.co/spaces/transformers-community/circle-ci-viz?pr=45616&sha=9a4b9f |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Draft moved to #45643