Skip to content

Add DeepSeek V4#45643

Draft
ArthurZucker wants to merge 11 commits intomainfrom
add-deepseek-v4
Draft

Add DeepSeek V4#45643
ArthurZucker wants to merge 11 commits intomainfrom
add-deepseek-v4

Conversation

@ArthurZucker
Copy link
Copy Markdown
Collaborator

Draft. Supersedes #45616.

@ArthurZucker ArthurZucker mentioned this pull request Apr 25, 2026
@HuggingFaceDocBuilderDev
Copy link
Copy Markdown

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@ArthurZucker
Copy link
Copy Markdown
Collaborator Author

Outputs are valid now

penguinwu added a commit to penguinwu/oss-model-graph-break-corpus that referenced this pull request Apr 25, 2026
… Phase 1 config + runner

Three coupled changes:

1) discovery/perf.py — harden per Rocky's notes (2026-04-25) on
   pytorch/benchmarks/dynamo/common.py:
   - patch_torch_manual_seed(seed=1337) — call once at process start;
     monkey-patches torch.manual_seed so HF models' internal RNG calls
     don't drift between runs (per Animesh on HF model non-determinism).
   - eager_self_check — runs forward twice with cloned identical inputs;
     reports max_abs_diff + deterministic bool. Detects models still
     non-deterministic even with the seed patch.
   - warm_peak_mem flag — captures both cold (default) and post-warmup
     peak memory. Don't conflate the two.
   - compile_times — captures torch._dynamo.utils.compile_times() dict
     (22+ metrics: _compile.compile_inner, GraphLowering.run, etc.) for
     cross-comparable compile-time analysis vs upstream HF dashboard.
   - methodology comments updated to reference common.py line numbers.

2) experiments/configs/deepseek-v4-pro-phase1.json — config for the
   Phase 1 eval. Scaled-but-architecturally-complete: ALL V4 features
   active at production dims (head_dim=512, q_lora_rank=1536, num_hash_
   layers=3, index_n_heads=64, hc_mult=4, hybrid attention, MLA, etc.);
   only num_hidden_layers (61->4), n_routed_experts (384->16), and
   vocab_size (129280->4096) scaled to fit 1x H100 in bf16. Pins the
   transformers PR branch sha (huggingface/transformers#45643 @ a0a8482).

3) experiments/scripts/run_deepseek_v4_pro_phase1.py — self-contained
   runner. Reads the config, applies seed patch + TF32 high precision,
   instantiates the model, and runs 4 dimensions in sequence:
     Step 1: instantiate + eager forward (param count, peak mem)
     Step 2: torch._dynamo.explain (graph break analysis)
     Step 3: correctness vs eager (max_abs_diff + bitwise_equal)
     Step 4: tier-1 perf via measure_perf (eager_ms / compiled_ms /
             speedup / compile_s + compile_times breakdown)
   Writes per-row results to experiments/results/deepseek_v4_pro/
   phase1-tiny-<datestamp>/results.json. Top-level torch.compile.

Phase 1 eval not yet executed — runner is ready; smoke-tested perf.py
upgrade. See experiments/deepseek_v4_pro_eval_plan.md.
Comment thread src/transformers/models/deepseek_v4/modular_deepseek_v4.py Outdated
Comment thread src/transformers/models/deepseek_v4/modular_deepseek_v4.py Outdated
Comment thread src/transformers/models/deepseek_v4/modular_deepseek_v4.py Outdated
Comment thread src/transformers/models/deepseek_v4/modular_deepseek_v4.py Outdated
Comment thread src/transformers/models/deepseek_v4/modular_deepseek_v4.py Outdated
Comment thread src/transformers/models/deepseek_v4/modular_deepseek_v4.py Outdated
Comment thread src/transformers/models/deepseek_v4/modular_deepseek_v4.py Outdated
Comment thread src/transformers/models/deepseek_v4/modular_deepseek_v4.py Outdated
Comment thread src/transformers/cache_utils.py Outdated
Comment thread src/transformers/models/deepseek_v4/modeling_deepseek_v4.py Outdated
Comment thread src/transformers/models/deepseek_v4/modeling_deepseek_v4.py Outdated
Comment thread src/transformers/models/deepseek_v4/modeling_deepseek_v4.py Outdated
Comment thread src/transformers/models/deepseek_v4/modeling_deepseek_v4.py Outdated
Comment thread src/transformers/models/deepseek_v4/modular_deepseek_v4.py Outdated
Comment thread src/transformers/models/deepseek_v4/modular_deepseek_v4.py Outdated
Comment thread src/transformers/models/deepseek_v4/modular_deepseek_v4.py Outdated
Comment thread src/transformers/models/deepseek_v4/modular_deepseek_v4.py Outdated
Comment thread src/transformers/models/deepseek_v4/modular_deepseek_v4.py Outdated
Comment thread src/transformers/models/deepseek_v4/modular_deepseek_v4.py Outdated
Adds DeepSeek V4 with hybrid CSA/HCA attention, lightning indexer,
manifold-constrained hyper-connections, shared K=V MQA with grouped
low-rank output, and per-head attention sink. Includes tokenizer/auto
mappings, finegrained FP8 quantization support, and unit tests.
ArthurZucker and others added 2 commits April 28, 2026 19:27
No inheritance between HCA and CSA: each has its own cache (DynamicSlidingWindowLayer
subclass) and compressor (nn.Module subclass). HCA stays minimal (non-overlapping
windows, no indexer); CSA explicitly carries the overlap state + indexer. Shared
math factored into module-level helpers — no coff/overlap branching, no
_compress_rate_attr indirection. Also adds 'sliding_attention' to COMPRESSOR_CLASSES
with None so the three attention types are dispatched explicitly in one place.
Generation tests were assuming V4 supports advanced decoding modes (assisted
generation, prompt lookup, contrastive search, static-cache compile) that the
compressor's running-window cache state can't service — its buffer / pool /
overlap fields aren't rewindable across drafts and aren't compatible with
:class:`StaticCache`. Set the right opt-out flags so generate raises a clear
error early and the corresponding tests skip cleanly:

* ``_is_stateful = True``      — gates assisted / prompt-lookup paths.
* ``_can_compile_fullgraph = False`` — gates the static-cache test (would
  otherwise hand the compressor a :class:`StaticSlidingWindowLayer` with no
  ``update_compressor`` method).
* ``_supports_flex_attn = False`` — V4 only validates eager attention; the
  compressor / indexer paths weren't checked under flex / SDPA / flash kernels.

Conversion mapping cleanup so save / load round-trips survive:

* Standardize on V3's ``apply_rotary_pos_emb_interleave`` for the partial-RoPE
  rotation, with a thin V4-side wrapper that permutes the rope channels back
  from the halves layout V3 leaves them in to the interleaved layout V4 was
  trained with — required because V4 is shared-KV (V == K rotated), so V's
  channel layout flows through ``wo_a`` / ``wo_b``.
* Restructure ``conversion_mapping.deepseek_v4`` into two passes: structural
  prefix renames first (``layers.X.attn.`` → ``model.layers.X.self_attn.``),
  then specific in-prefix renames on the already-prefixed HF-form keys
  (``...self_attn.compressor.norm.`` → ``...self_attn.compressor.kv_norm.``).
  A single-pass ordering loses information in either the forward or reverse
  direction (overlapping general / specific patterns conflict).
* Move the FP8 ``.scale`` → ``.weight_scale_inv`` rename out of the V4 static
  conversion list and into ``FineGrainedFP8HfQuantizer.update_weight_conversions``
  so the rule is only registered when FP8 dequant is actually active. Lets
  ``test_reverse_loading_mapping`` skip an unrelated FP8 rule on plain saves.

Test fixes:

* Skip ``test_reverse_loading_mapping`` with a docstring spelling out why the
  two-pass mapping can't satisfy that test's invariant (its Pass 2 source
  patterns are HF-form by design; ``test_save_load`` exercises the actual
  round-trip).
* Skip ``test_left_padding_compatibility`` — V4's compressor pre-pools
  ``compress_rate``-token windows before the attention mask is applied, so
  left padding shifts window boundaries and folds pad tokens into pooled
  KV entries (same fundamental limit as RecurrentGemma).
* Add ``model.to(torch_device)`` in the ``test_hidden_states_output`` override
  so cuda inputs don't hit a cpu model.
* ``test_tiny_generate_runs`` now passes ``eos_token_id=-1`` so a freshly
  initialised random model doesn't EOS-stop before max_new_tokens, making the
  shape assertion deterministic.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@vasqu vasqu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok I went into details this time, imo the RoPE is messy atm I'm pretty sure it can be refactored into a more normal style

# E2M1 (FP4) value table — checkpoints sometimes ship MoE experts as packed FP4
# (two e2m1 nibbles per int8 byte), so the "weight" dtype lands as ``int8`` /
# ``float4_e2m1fn_x2`` and we have to unpack before applying the scale grid.
_FP4_E2M1_LUT = (0.0, 0.5, 1.0, 1.5, 2.0, 3.0, 4.0, 6.0, -0.0, -0.5, -1.0, -1.5, -2.0, -3.0, -4.0, -6.0)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh that's a bit awkward ngl - guess they did have to make a workaround for that. Only blackwell has native fp4 support iirc

Comment thread src/transformers/models/deepseek_v4/modular_deepseek_v4.py
Comment thread src/transformers/models/deepseek_v4/modular_deepseek_v4.py
Comment thread src/transformers/models/deepseek_v4/modular_deepseek_v4.py Outdated
Comment thread src/transformers/models/deepseek_v4/modular_deepseek_v4.py Outdated
Comment thread src/transformers/models/deepseek_v4/modular_deepseek_v4.py
Comment thread src/transformers/models/deepseek_v4/modular_deepseek_v4.py Outdated
Comment thread src/transformers/models/deepseek_v4/modular_deepseek_v4.py Outdated
Comment thread src/transformers/models/deepseek_v4/modular_deepseek_v4.py Outdated
Comment thread src/transformers/models/deepseek_v4/modular_deepseek_v4.py Outdated
@kylesayrs
Copy link
Copy Markdown
Contributor

kylesayrs commented Apr 28, 2026

FYI I had issues decompressing the model, potentially due to not being able to match to the weight_inverse conversion mappings. Still investigating.

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "deepseek-ai/DeepSeek-V4-Flash",
    torch_dtype="auto",
    device_map="cpu",
)
tokenizer = AutoTokenizer.from_pretrained("deepseek-ai/DeepSeek-V4-Flash")

save_dir = "DeepSeek-V4-Flash-bf16"
#model.dequantize(torch.bfloat16)
model.save_pretrained(save_dir)
tokenizer.save_pretrained(save_dir)
Analysis
  Original:
  WeightConverter(
      source_patterns="experts.*.w2.weight",
      target_patterns="experts.down_proj",
      operations=[MergeModulelist(dim=0)],
  )

  After update_weight_conversions:
  WeightConverter(
      source_patterns=["experts.*.w2.weight$", "experts.*.w2.weight_scale_inv$"],
      target_patterns="experts.down_proj",
      source_patterns=["experts.*.w2.weight$", "experts.*.w2.weight_scale_inv$"],
      target_patterns="experts.down_proj",
      operations=[Fp8Dequantize, MergeModulelist(dim=0)],
  )
  
  During loading this works fine — Fp8Dequantize consumes the scale entries and drops
  them, MergeModulelist merges the dequantized weights. The model ends up with
  experts.down_proj in BF16, no scale parameter.

  During save_pretrained, revert_weight_conversion calls reverse_transform(), producing:
  WeightConverter(
      source_patterns=["experts.down_proj"],          # 1 source
      target_patterns=["experts.*.w2.weight$",        # 2 targets
                       "experts.*.w2.weight_scale_inv$"],
      operations=[SplitModulelist(dim=0), _IdentityOp()],
  )

  When SplitModulelist.convert runs at core_model_loading.py:243-251:
  - input_dict has 1 entry (just the dequantized down_proj)
  - target_patterns has 2 entries (weight + scale)
  - len(input_dict) == 1 and len(target_patterns) != 1 → line 251 raises
  ValueError("Undefined Operation encountered!")

  The gate_up_proj converter doesn't hit this same error because its reversed ops include
   Chunk(dim=1) before SplitModulelist, and Chunk expands the 1 input into 4 entries — so
   SplitModulelist takes the else branch. However, it would produce incorrect data
  (chunking a 2-part tensor into 4 parts).

  Root cause: update_weight_conversions adds scale source patterns to existing
  converters, but Fp8Dequantize.reverse_op returns _IdentityOp() (a pass-through), so the
   reversed pipeline has no way to regenerate scale tensors that were consumed during
  dequantization. The target pattern count (weight + scale) no longer matches the input
  count (weight only).

Copy link
Copy Markdown

@0hujun 0hujun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Report a bug, the intermediate_size may replace to moe_intermediate_size, cause in mlp layers, deepseek uses moe_intermediate_size. And i have run test on npu but got an error The size of tensor a (2048) must match the size of tensor b (18432) at non-singleton dimension 2. After using moe_intermediate_size instead,test fine.

Comment thread src/transformers/models/deepseek_v4/modeling_deepseek_v4.py Outdated
@ArthurZucker
Copy link
Copy Markdown
Collaborator Author

@kylesayrs try with the flag to prevent reverse conversion, did not have time to implement it yet its a bit annoying

- apply_rotary_pos_emb takes one tensor + handles trailing-rope slicing internally;
  rotate_half-style ernie pattern with repeat_interleave; rotary forward emits
  half-sized cos/sin (no end-to-end duplication).
- Inherit DeepseekV4RotaryEmbedding from LagunaRotaryEmbedding (partial-rotary
  compute_default_rope_parameters).
- Config:
  * compress_rates dict keyed by layer type (BC kwargs for compress_rate_csa/hca).
  * mlp_layer_types list (BC kwargs for num_hash_layers); MLPBlock dispatches via it.
  * qk_rope_head_dim derived from partial_rotary_factor (BC kwarg accepted).
  * Drop V3 inheritance + V3-only fields (kv_lora_rank, qk_nope_head_dim, v_head_dim,
    n_group, topk_group, first_k_dense_replace, rope_interleave).
- Rename attention/compressor/indexer leaf weights to *_proj convention; add
  conversion_mapping rules to load upstream wq_*/wkv/wgate/wo_* names.
- DeepseekV4MLP no longer inherits Qwen2MoeMLP — uses moe_intermediate_size.
- GroupedLinear forward simplified to MHA-style transpose pattern.
- Indexer / compressor: pool window views use -1 last dim (TP-friendly), softmax
  in fp32, rope_layer_type as class attr.
- Drop dead self.compress_rate / self.qk_nope_head_dim assignments.
- DeepseekV4UnweightedRMSNorm: extracted weight-less RMSNorm class, used by
  attention's per-head Q rescale + both HC modules' input rescale.
- HyperConnection.forward returns (post, comb, collapsed) — moves the stream
  collapse into the mHC module instead of the DecoderLayer.
- Document the 3 in mHC scale param (pre / post / comb).
- DecoderLayer: input_ids in explicit signature (was kwargs.get).
- Comment defending the compressor mask pad against FA / SDPA backends.
- DeepseekV4Router: unified TopK + Hash routers into one class with a
  select_indices hook (top-k + e_score_correction_bias vs tid2eid lookup).
- Rename buffer ``bias`` → ``e_score_correction_bias`` (cross-model standard);
  add gate.bias → e_score_correction_bias rule in conversion_mapping.
- DeepseekV4Experts: use config.num_local_experts (routes through attribute_map)
  so FP8 / TP integrations stay robust.
- Drop unused self.rotary_emb_compress on the model.
- Simplify DeepseekV4ForCausalLM to a bare `pass` inheriting MixtralForCausalLM.
reverse_op was _IdentityOp, so saving a model that had been loaded with
dequantize=True dropped the FP8 layout — saved checkpoints lost their
weight_scale_inv keys and round-trip through save_pretrained was lossy. Pair the
two ops symmetrically: Fp8Dequantize.reverse_op -> Fp8Quantize and
Fp8Quantize.reverse_op -> Fp8Dequantize.

Fp8Quantize.convert refactored to handle the per-expert save chain
(SplitModulelist emits one key per expert -> Fp8Quantize quantizes each), and to
pass non-tileable tensors through unchanged (1D norms / biases / odd 2D shapes
that were never quantized on the load side).
Comment thread src/transformers/models/deepseek_v4/modular_deepseek_v4.py Outdated
Comment thread src/transformers/models/deepseek_v4/modular_deepseek_v4.py Outdated
Comment thread src/transformers/models/deepseek_v4/modular_deepseek_v4.py Outdated
Comment thread src/transformers/models/deepseek_v4/modular_deepseek_v4.py Outdated
Comment thread src/transformers/models/deepseek_v4/modular_deepseek_v4.py Outdated
Comment thread src/transformers/models/deepseek_v4/modular_deepseek_v4.py
- Drop the local rotate_half def, import from glm.modeling_glm (identical body).
- Iterate set(self.layer_types) in DeepseekV4RotaryEmbedding.__init__ for
  consistency with the gemma3 idiom.
- DeepseekV4MLP inherits LlamaMLP (was a hand-written nn.Module). Config
  attribute_map routes intermediate_size -> moe_intermediate_size and adds
  mlp_bias=False, so LlamaMLP's __init__ builds the right shared-expert linears
  without an override.
- DeepseekV4Experts inherits MixtralExperts (was GptOssExperts with an
  __init__ + _apply_gate override that duplicated everything). MixtralExperts'
  layout matches V4-Flash's; the only V4-specific bit is the swiglu_limit clamp
  on gate / up before SiLU, kept inline in the overridden forward.
- Split the unified DeepseekV4Router back into DeepseekV4TopKRouter and
  DeepseekV4HashRouter (Arthur preferred two explicit classes over a
  conditional select_indices hook).
- Drop **_ from DeepseekV4SparseMoeBlock.forward — the layer's caller
  (DeepseekV4DecoderLayer) already filters kwargs.
- DeepseekV4Model now inherits LlamaModel. super().__init__ sets up
  embed_tokens / norm / rotary_emb / gradient_checkpointing; we override the
  layer list, swap rotary_emb for the multi-layer-type V4 one, add hc_head, and
  keep the V4-specific forward.
@github-actions
Copy link
Copy Markdown
Contributor

[For maintainers] Suggested jobs to run (before merge)

run-slow: auto, deepseek_v4, finegrained_fp8

Comment thread src/transformers/models/deepseek_v4/modular_deepseek_v4.py Outdated
Comment thread src/transformers/models/deepseek_v4/modular_deepseek_v4.py Outdated
Comment thread src/transformers/models/deepseek_v4/modular_deepseek_v4.py Outdated
Comment thread src/transformers/models/deepseek_v4/modular_deepseek_v4.py Outdated
Comment thread src/transformers/models/deepseek_v4/modular_deepseek_v4.py Outdated
Comment thread src/transformers/models/deepseek_v4/modular_deepseek_v4.py Outdated
ArthurZucker and others added 4 commits April 29, 2026 13:52
Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>
Configuration is now hand-edited in configuration_deepseek_v4.py — modular no
longer defines it, removes it from __all__, and imports it. The converter no
longer regenerates the config file (no class with Config suffix means nothing to
emit there).

__post_init__ is collapsed onto five small _resolve_* methods + a single
_apply_legacy_kwargs helper that strips the legacy V3-flavoured kwargs
(compress_rate_csa/hca, num_hash_layers, qk_rope_head_dim, compress_ratios)
into typed instance fields, so __post_init__ itself reads as a sequence of
named steps.

Also expand docs/source/en/model_doc/deepseek_v4.md with an Architecture section
(hybrid attention / mHC / MoE schedule / cache layers) cross-referenced to the
paper sections.

Type-check fix: gate the WeightConverter.operations access in
quantizer_finegrained_fp8.py with isinstance, so WeightRenaming entries pass
through untouched.
…near

V4 is shared-KV MQA (num_kv_heads = 1). With TP, q_b_proj is colwise-sharded so
the local q has num_heads / tp_size heads while kv stays replicated at one head.
The eager / sdpa / flash backends all read module.num_key_value_groups to repeat
kv up to q's head count — a fixed global value of num_attention_heads gives the
wrong (over-)expansion factor on every rank but the first. Refresh
num_key_value_groups from q.shape[1] in DeepseekV4Attention.forward, after the
local q has been built, so repeat_kv(key, num_key_value_groups) lifts the single
kv head to exactly the rank-local query head count.

DeepseekV4GroupedLinear was using a single bmm for the per-group projection.
torchao's Float8Tensor (used by tests_tensor_parallel_ci's
test_tp_generation_quantized) only fast-paths F.linear; bmm hits an mslk kernel
assertion (`bmm is not supported when mslk is not installed`). Replace the bmm
with a small per-group F.linear loop — slower for tiny configs, but cuts the
torchao dependency and the quantized-TP path now works without mslk.
The bmm was changed to F.linear because torchao's Float8Tensor doesn't fast-path
bmm without the mslk kernel. Reverting since a custom V4 FP8 path will land
later — we don't want to slow the unquantized GroupedLinear forward (~8x more
ops with n_groups=8) just to avoid one CI failure on the quantized-TP test.
@github-actions
Copy link
Copy Markdown
Contributor

View the CircleCI Test Summary for this PR:

https://huggingface.co/spaces/transformers-community/circle-ci-viz?pr=45643&sha=092dcd

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants