Add DeepSeek V4 by ArthurZucker · Pull Request #45643 · huggingface/transformers

ArthurZucker · 2026-04-25T00:03:25Z

Draft. Supersedes #45616.

HuggingFaceDocBuilderDev · 2026-04-25T00:15:33Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

ArthurZucker · 2026-04-25T07:21:28Z

Outputs are valid now

… Phase 1 config + runner Three coupled changes: 1) discovery/perf.py — harden per Rocky's notes (2026-04-25) on pytorch/benchmarks/dynamo/common.py: - patch_torch_manual_seed(seed=1337) — call once at process start; monkey-patches torch.manual_seed so HF models' internal RNG calls don't drift between runs (per Animesh on HF model non-determinism). - eager_self_check — runs forward twice with cloned identical inputs; reports max_abs_diff + deterministic bool. Detects models still non-deterministic even with the seed patch. - warm_peak_mem flag — captures both cold (default) and post-warmup peak memory. Don't conflate the two. - compile_times — captures torch._dynamo.utils.compile_times() dict (22+ metrics: _compile.compile_inner, GraphLowering.run, etc.) for cross-comparable compile-time analysis vs upstream HF dashboard. - methodology comments updated to reference common.py line numbers. 2) experiments/configs/deepseek-v4-pro-phase1.json — config for the Phase 1 eval. Scaled-but-architecturally-complete: ALL V4 features active at production dims (head_dim=512, q_lora_rank=1536, num_hash_ layers=3, index_n_heads=64, hc_mult=4, hybrid attention, MLA, etc.); only num_hidden_layers (61->4), n_routed_experts (384->16), and vocab_size (129280->4096) scaled to fit 1x H100 in bf16. Pins the transformers PR branch sha (huggingface/transformers#45643 @ a0a8482). 3) experiments/scripts/run_deepseek_v4_pro_phase1.py — self-contained runner. Reads the config, applies seed patch + TF32 high precision, instantiates the model, and runs 4 dimensions in sequence: Step 1: instantiate + eager forward (param count, peak mem) Step 2: torch._dynamo.explain (graph break analysis) Step 3: correctness vs eager (max_abs_diff + bitwise_equal) Step 4: tier-1 perf via measure_perf (eager_ms / compiled_ms / speedup / compile_s + compile_times breakdown) Writes per-row results to experiments/results/deepseek_v4_pro/ phase1-tiny-<datestamp>/results.json. Top-level torch.compile. Phase 1 eval not yet executed — runner is ready; smoke-tested perf.py upgrade. See experiments/deepseek_v4_pro_eval_plan.md.

Adds DeepSeek V4 with hybrid CSA/HCA attention, lightning indexer, manifold-constrained hyper-connections, shared K=V MQA with grouped low-rank output, and per-head attention sink. Includes tokenizer/auto mappings, finegrained FP8 quantization support, and unit tests.

No inheritance between HCA and CSA: each has its own cache (DynamicSlidingWindowLayer subclass) and compressor (nn.Module subclass). HCA stays minimal (non-overlapping windows, no indexer); CSA explicitly carries the overlap state + indexer. Shared math factored into module-level helpers — no coff/overlap branching, no _compress_rate_attr indirection. Also adds 'sliding_attention' to COMPRESSOR_CLASSES with None so the three attention types are dispatched explicitly in one place.

Generation tests were assuming V4 supports advanced decoding modes (assisted generation, prompt lookup, contrastive search, static-cache compile) that the compressor's running-window cache state can't service — its buffer / pool / overlap fields aren't rewindable across drafts and aren't compatible with :class:`StaticCache`. Set the right opt-out flags so generate raises a clear error early and the corresponding tests skip cleanly: * ``_is_stateful = True`` — gates assisted / prompt-lookup paths. * ``_can_compile_fullgraph = False`` — gates the static-cache test (would otherwise hand the compressor a :class:`StaticSlidingWindowLayer` with no ``update_compressor`` method). * ``_supports_flex_attn = False`` — V4 only validates eager attention; the compressor / indexer paths weren't checked under flex / SDPA / flash kernels. Conversion mapping cleanup so save / load round-trips survive: * Standardize on V3's ``apply_rotary_pos_emb_interleave`` for the partial-RoPE rotation, with a thin V4-side wrapper that permutes the rope channels back from the halves layout V3 leaves them in to the interleaved layout V4 was trained with — required because V4 is shared-KV (V == K rotated), so V's channel layout flows through ``wo_a`` / ``wo_b``. * Restructure ``conversion_mapping.deepseek_v4`` into two passes: structural prefix renames first (``layers.X.attn.`` → ``model.layers.X.self_attn.``), then specific in-prefix renames on the already-prefixed HF-form keys (``...self_attn.compressor.norm.`` → ``...self_attn.compressor.kv_norm.``). A single-pass ordering loses information in either the forward or reverse direction (overlapping general / specific patterns conflict). * Move the FP8 ``.scale`` → ``.weight_scale_inv`` rename out of the V4 static conversion list and into ``FineGrainedFP8HfQuantizer.update_weight_conversions`` so the rule is only registered when FP8 dequant is actually active. Lets ``test_reverse_loading_mapping`` skip an unrelated FP8 rule on plain saves. Test fixes: * Skip ``test_reverse_loading_mapping`` with a docstring spelling out why the two-pass mapping can't satisfy that test's invariant (its Pass 2 source patterns are HF-form by design; ``test_save_load`` exercises the actual round-trip). * Skip ``test_left_padding_compatibility`` — V4's compressor pre-pools ``compress_rate``-token windows before the attention mask is applied, so left padding shifts window boundaries and folds pad tokens into pooled KV entries (same fundamental limit as RecurrentGemma). * Add ``model.to(torch_device)`` in the ``test_hidden_states_output`` override so cuda inputs don't hit a cpu model. * ``test_tiny_generate_runs`` now passes ``eos_token_id=-1`` so a freshly initialised random model doesn't EOS-stop before max_new_tokens, making the shape assertion deterministic. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

vasqu

Ok I went into details this time, imo the RoPE is messy atm I'm pretty sure it can be refactored into a more normal style

vasqu · 2026-04-28T08:20:15Z

+    # E2M1 (FP4) value table — checkpoints sometimes ship MoE experts as packed FP4
+    # (two e2m1 nibbles per int8 byte), so the "weight" dtype lands as ``int8`` /
+    # ``float4_e2m1fn_x2`` and we have to unpack before applying the scale grid.
+    _FP4_E2M1_LUT = (0.0, 0.5, 1.0, 1.5, 2.0, 3.0, 4.0, 6.0, -0.0, -0.5, -1.0, -1.5, -2.0, -3.0, -4.0, -6.0)


Oh that's a bit awkward ngl - guess they did have to make a workaround for that. Only blackwell has native fp4 support iirc

kylesayrs · 2026-04-28T21:41:06Z

FYI I had issues decompressing the model, potentially due to not being able to match to the weight_inverse conversion mappings. Still investigating.

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "deepseek-ai/DeepSeek-V4-Flash",
    torch_dtype="auto",
    device_map="cpu",
)
tokenizer = AutoTokenizer.from_pretrained("deepseek-ai/DeepSeek-V4-Flash")

save_dir = "DeepSeek-V4-Flash-bf16"
#model.dequantize(torch.bfloat16)
model.save_pretrained(save_dir)
tokenizer.save_pretrained(save_dir)

Analysis

  Original:
  WeightConverter(
      source_patterns="experts.*.w2.weight",
      target_patterns="experts.down_proj",
      operations=[MergeModulelist(dim=0)],
  )

  After update_weight_conversions:
  WeightConverter(
      source_patterns=["experts.*.w2.weight$", "experts.*.w2.weight_scale_inv$"],
      target_patterns="experts.down_proj",
      source_patterns=["experts.*.w2.weight$", "experts.*.w2.weight_scale_inv$"],
      target_patterns="experts.down_proj",
      operations=[Fp8Dequantize, MergeModulelist(dim=0)],
  )
  
  During loading this works fine — Fp8Dequantize consumes the scale entries and drops
  them, MergeModulelist merges the dequantized weights. The model ends up with
  experts.down_proj in BF16, no scale parameter.

  During save_pretrained, revert_weight_conversion calls reverse_transform(), producing:
  WeightConverter(
      source_patterns=["experts.down_proj"],          # 1 source
      target_patterns=["experts.*.w2.weight$",        # 2 targets
                       "experts.*.w2.weight_scale_inv$"],
      operations=[SplitModulelist(dim=0), _IdentityOp()],
  )

  When SplitModulelist.convert runs at core_model_loading.py:243-251:
  - input_dict has 1 entry (just the dequantized down_proj)
  - target_patterns has 2 entries (weight + scale)
  - len(input_dict) == 1 and len(target_patterns) != 1 → line 251 raises
  ValueError("Undefined Operation encountered!")

  The gate_up_proj converter doesn't hit this same error because its reversed ops include
   Chunk(dim=1) before SplitModulelist, and Chunk expands the 1 input into 4 entries — so
   SplitModulelist takes the else branch. However, it would produce incorrect data
  (chunking a 2-part tensor into 4 parts).

  Root cause: update_weight_conversions adds scale source patterns to existing
  converters, but Fp8Dequantize.reverse_op returns _IdentityOp() (a pass-through), so the
   reversed pipeline has no way to regenerate scale tensors that were consumed during
  dequantization. The target pattern count (weight + scale) no longer matches the input
  count (weight only).

0hujun

Report a bug, the intermediate_size may replace to moe_intermediate_size, cause in mlp layers, deepseek uses moe_intermediate_size. And i have run test on npu but got an error The size of tensor a (2048) must match the size of tensor b (18432) at non-singleton dimension 2. After using moe_intermediate_size instead，test fine.

ArthurZucker · 2026-04-29T08:01:56Z

@kylesayrs try with the flag to prevent reverse conversion, did not have time to implement it yet its a bit annoying

- apply_rotary_pos_emb takes one tensor + handles trailing-rope slicing internally; rotate_half-style ernie pattern with repeat_interleave; rotary forward emits half-sized cos/sin (no end-to-end duplication). - Inherit DeepseekV4RotaryEmbedding from LagunaRotaryEmbedding (partial-rotary compute_default_rope_parameters). - Config: * compress_rates dict keyed by layer type (BC kwargs for compress_rate_csa/hca). * mlp_layer_types list (BC kwargs for num_hash_layers); MLPBlock dispatches via it. * qk_rope_head_dim derived from partial_rotary_factor (BC kwarg accepted). * Drop V3 inheritance + V3-only fields (kv_lora_rank, qk_nope_head_dim, v_head_dim, n_group, topk_group, first_k_dense_replace, rope_interleave). - Rename attention/compressor/indexer leaf weights to *_proj convention; add conversion_mapping rules to load upstream wq_*/wkv/wgate/wo_* names. - DeepseekV4MLP no longer inherits Qwen2MoeMLP — uses moe_intermediate_size. - GroupedLinear forward simplified to MHA-style transpose pattern. - Indexer / compressor: pool window views use -1 last dim (TP-friendly), softmax in fp32, rope_layer_type as class attr. - Drop dead self.compress_rate / self.qk_nope_head_dim assignments.

- DeepseekV4UnweightedRMSNorm: extracted weight-less RMSNorm class, used by attention's per-head Q rescale + both HC modules' input rescale. - HyperConnection.forward returns (post, comb, collapsed) — moves the stream collapse into the mHC module instead of the DecoderLayer. - Document the 3 in mHC scale param (pre / post / comb). - DecoderLayer: input_ids in explicit signature (was kwargs.get). - Comment defending the compressor mask pad against FA / SDPA backends. - DeepseekV4Router: unified TopK + Hash routers into one class with a select_indices hook (top-k + e_score_correction_bias vs tid2eid lookup). - Rename buffer ``bias`` → ``e_score_correction_bias`` (cross-model standard); add gate.bias → e_score_correction_bias rule in conversion_mapping. - DeepseekV4Experts: use config.num_local_experts (routes through attribute_map) so FP8 / TP integrations stay robust. - Drop unused self.rotary_emb_compress on the model. - Simplify DeepseekV4ForCausalLM to a bare `pass` inheriting MixtralForCausalLM.

reverse_op was _IdentityOp, so saving a model that had been loaded with dequantize=True dropped the FP8 layout — saved checkpoints lost their weight_scale_inv keys and round-trip through save_pretrained was lossy. Pair the two ops symmetrically: Fp8Dequantize.reverse_op -> Fp8Quantize and Fp8Quantize.reverse_op -> Fp8Dequantize. Fp8Quantize.convert refactored to handle the per-expert save chain (SplitModulelist emits one key per expert -> Fp8Quantize quantizes each), and to pass non-tileable tensors through unchanged (1D norms / biases / odd 2D shapes that were never quantized on the load side).

- Drop the local rotate_half def, import from glm.modeling_glm (identical body). - Iterate set(self.layer_types) in DeepseekV4RotaryEmbedding.__init__ for consistency with the gemma3 idiom. - DeepseekV4MLP inherits LlamaMLP (was a hand-written nn.Module). Config attribute_map routes intermediate_size -> moe_intermediate_size and adds mlp_bias=False, so LlamaMLP's __init__ builds the right shared-expert linears without an override. - DeepseekV4Experts inherits MixtralExperts (was GptOssExperts with an __init__ + _apply_gate override that duplicated everything). MixtralExperts' layout matches V4-Flash's; the only V4-specific bit is the swiglu_limit clamp on gate / up before SiLU, kept inline in the overridden forward. - Split the unified DeepseekV4Router back into DeepseekV4TopKRouter and DeepseekV4HashRouter (Arthur preferred two explicit classes over a conditional select_indices hook). - Drop **_ from DeepseekV4SparseMoeBlock.forward — the layer's caller (DeepseekV4DecoderLayer) already filters kwargs. - DeepseekV4Model now inherits LlamaModel. super().__init__ sets up embed_tokens / norm / rotary_emb / gradient_checkpointing; we override the layer list, swap rotary_emb for the multi-layer-type V4 one, add hc_head, and keep the V4-specific forward.

github-actions · 2026-04-29T11:44:05Z

[For maintainers] Suggested jobs to run (before merge)

run-slow: auto, deepseek_v4, finegrained_fp8

Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>

Configuration is now hand-edited in configuration_deepseek_v4.py — modular no longer defines it, removes it from __all__, and imports it. The converter no longer regenerates the config file (no class with Config suffix means nothing to emit there). __post_init__ is collapsed onto five small _resolve_* methods + a single _apply_legacy_kwargs helper that strips the legacy V3-flavoured kwargs (compress_rate_csa/hca, num_hash_layers, qk_rope_head_dim, compress_ratios) into typed instance fields, so __post_init__ itself reads as a sequence of named steps. Also expand docs/source/en/model_doc/deepseek_v4.md with an Architecture section (hybrid attention / mHC / MoE schedule / cache layers) cross-referenced to the paper sections. Type-check fix: gate the WeightConverter.operations access in quantizer_finegrained_fp8.py with isinstance, so WeightRenaming entries pass through untouched.

…near V4 is shared-KV MQA (num_kv_heads = 1). With TP, q_b_proj is colwise-sharded so the local q has num_heads / tp_size heads while kv stays replicated at one head. The eager / sdpa / flash backends all read module.num_key_value_groups to repeat kv up to q's head count — a fixed global value of num_attention_heads gives the wrong (over-)expansion factor on every rank but the first. Refresh num_key_value_groups from q.shape[1] in DeepseekV4Attention.forward, after the local q has been built, so repeat_kv(key, num_key_value_groups) lifts the single kv head to exactly the rank-local query head count. DeepseekV4GroupedLinear was using a single bmm for the per-group projection. torchao's Float8Tensor (used by tests_tensor_parallel_ci's test_tp_generation_quantized) only fast-paths F.linear; bmm hits an mslk kernel assertion (`bmm is not supported when mslk is not installed`). Replace the bmm with a small per-group F.linear loop — slower for tiny configs, but cuts the torchao dependency and the quantized-TP path now works without mslk.

The bmm was changed to F.linear because torchao's Float8Tensor doesn't fast-path bmm without the mslk kernel. Reverting since a custom V4 FP8 path will land later — we don't want to slow the unquantized GroupedLinear forward (~8x more ops with n_groups=8) just to avoid one CI failure on the quantized-TP test.

github-actions · 2026-04-29T12:24:22Z

View the CircleCI Test Summary for this PR:

https://huggingface.co/spaces/transformers-community/circle-ci-viz?pr=45643&sha=092dcd

ArthurZucker mentioned this pull request Apr 25, 2026

Add DeepSeek V4 #45616

Closed

Blaizzy mentioned this pull request Apr 25, 2026

Add DeepSeek-v4 (Flash/Pro) ml-explore/mlx-lm#1192

Open

penguinwu mentioned this pull request Apr 25, 2026

[Eval] DeepSeek V4 Pro evaluation penguinwu/oss-model-graph-break-corpus#66

Open

Blaizzy mentioned this pull request Apr 25, 2026

perf: optimize DeepSeek-V4 Blaizzy/mlx-lm#13

Merged

hiyouga mentioned this pull request Apr 26, 2026

[model] support DeepSeek V4 hiyouga/LlamaFactory#10434

Open

ArthurZucker commented Apr 27, 2026

View reviewed changes

Comment thread src/transformers/models/deepseek_v4/modular_deepseek_v4.py Outdated

ArthurZucker commented Apr 27, 2026

View reviewed changes

Comment thread src/transformers/models/deepseek_v4/modular_deepseek_v4.py Outdated

Comment thread src/transformers/models/deepseek_v4/modular_deepseek_v4.py Outdated

ArthurZucker commented Apr 27, 2026

View reviewed changes

graydh mentioned this pull request Apr 28, 2026

无法加载 DeepSeek-V4-Flash 模型 - unsupported dtype F8_E8M0 和 model type deepseek_v4 not supported jundot/omlx#956

Open

ArthurZucker commented Apr 28, 2026

View reviewed changes

Comment thread src/transformers/models/deepseek_v4/modular_deepseek_v4.py Outdated

ArthurZucker commented Apr 28, 2026

View reviewed changes

ArthurZucker force-pushed the add-deepseek-v4 branch from a79ed83 to b9a5c6b Compare April 28, 2026 09:47

ArthurZucker force-pushed the add-deepseek-v4 branch from f2ffc23 to 26c62d0 Compare April 28, 2026 10:05

ArthurZucker and others added 2 commits April 28, 2026 19:27

vasqu reviewed Apr 28, 2026

View reviewed changes

evalstate mentioned this pull request Apr 28, 2026

Cumulative defect fixes from recent Transformers PRs evalstate/transformers#41

Open

frozenleaves mentioned this pull request Apr 29, 2026

fix train deepseek V4 with fsdp2: AttributeError: 'Tensor' object has no attribute 'device_mesh' huggingface/accelerate#4023

Open

0hujun reviewed Apr 29, 2026

View reviewed changes

Comment thread src/transformers/models/deepseek_v4/modeling_deepseek_v4.py Outdated

ArthurZucker added 3 commits April 29, 2026 19:48

ArthurZucker commented Apr 29, 2026

View reviewed changes

Comment thread src/transformers/models/deepseek_v4/modular_deepseek_v4.py Outdated

ArthurZucker and others added 4 commits April 29, 2026 13:52

Apply suggestions from code review

6d82332

Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>

Conversation

ArthurZucker commented Apr 25, 2026

Uh oh!

HuggingFaceDocBuilderDev commented Apr 25, 2026

Uh oh!

ArthurZucker commented Apr 25, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

vasqu left a comment

Choose a reason for hiding this comment

Uh oh!

vasqu Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kylesayrs commented Apr 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

0hujun left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ArthurZucker commented Apr 29, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions Bot commented Apr 29, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions Bot commented Apr 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

kylesayrs commented Apr 28, 2026 •

edited

Loading