Add DeepSeek V4#45643
Conversation
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
|
Outputs are valid now |
… Phase 1 config + runner
Three coupled changes:
1) discovery/perf.py — harden per Rocky's notes (2026-04-25) on
pytorch/benchmarks/dynamo/common.py:
- patch_torch_manual_seed(seed=1337) — call once at process start;
monkey-patches torch.manual_seed so HF models' internal RNG calls
don't drift between runs (per Animesh on HF model non-determinism).
- eager_self_check — runs forward twice with cloned identical inputs;
reports max_abs_diff + deterministic bool. Detects models still
non-deterministic even with the seed patch.
- warm_peak_mem flag — captures both cold (default) and post-warmup
peak memory. Don't conflate the two.
- compile_times — captures torch._dynamo.utils.compile_times() dict
(22+ metrics: _compile.compile_inner, GraphLowering.run, etc.) for
cross-comparable compile-time analysis vs upstream HF dashboard.
- methodology comments updated to reference common.py line numbers.
2) experiments/configs/deepseek-v4-pro-phase1.json — config for the
Phase 1 eval. Scaled-but-architecturally-complete: ALL V4 features
active at production dims (head_dim=512, q_lora_rank=1536, num_hash_
layers=3, index_n_heads=64, hc_mult=4, hybrid attention, MLA, etc.);
only num_hidden_layers (61->4), n_routed_experts (384->16), and
vocab_size (129280->4096) scaled to fit 1x H100 in bf16. Pins the
transformers PR branch sha (huggingface/transformers#45643 @ a0a8482).
3) experiments/scripts/run_deepseek_v4_pro_phase1.py — self-contained
runner. Reads the config, applies seed patch + TF32 high precision,
instantiates the model, and runs 4 dimensions in sequence:
Step 1: instantiate + eager forward (param count, peak mem)
Step 2: torch._dynamo.explain (graph break analysis)
Step 3: correctness vs eager (max_abs_diff + bitwise_equal)
Step 4: tier-1 perf via measure_perf (eager_ms / compiled_ms /
speedup / compile_s + compile_times breakdown)
Writes per-row results to experiments/results/deepseek_v4_pro/
phase1-tiny-<datestamp>/results.json. Top-level torch.compile.
Phase 1 eval not yet executed — runner is ready; smoke-tested perf.py
upgrade. See experiments/deepseek_v4_pro_eval_plan.md.
a79ed83 to
b9a5c6b
Compare
Adds DeepSeek V4 with hybrid CSA/HCA attention, lightning indexer, manifold-constrained hyper-connections, shared K=V MQA with grouped low-rank output, and per-head attention sink. Includes tokenizer/auto mappings, finegrained FP8 quantization support, and unit tests.
f2ffc23 to
26c62d0
Compare
No inheritance between HCA and CSA: each has its own cache (DynamicSlidingWindowLayer subclass) and compressor (nn.Module subclass). HCA stays minimal (non-overlapping windows, no indexer); CSA explicitly carries the overlap state + indexer. Shared math factored into module-level helpers — no coff/overlap branching, no _compress_rate_attr indirection. Also adds 'sliding_attention' to COMPRESSOR_CLASSES with None so the three attention types are dispatched explicitly in one place.
Generation tests were assuming V4 supports advanced decoding modes (assisted generation, prompt lookup, contrastive search, static-cache compile) that the compressor's running-window cache state can't service — its buffer / pool / overlap fields aren't rewindable across drafts and aren't compatible with :class:`StaticCache`. Set the right opt-out flags so generate raises a clear error early and the corresponding tests skip cleanly: * ``_is_stateful = True`` — gates assisted / prompt-lookup paths. * ``_can_compile_fullgraph = False`` — gates the static-cache test (would otherwise hand the compressor a :class:`StaticSlidingWindowLayer` with no ``update_compressor`` method). * ``_supports_flex_attn = False`` — V4 only validates eager attention; the compressor / indexer paths weren't checked under flex / SDPA / flash kernels. Conversion mapping cleanup so save / load round-trips survive: * Standardize on V3's ``apply_rotary_pos_emb_interleave`` for the partial-RoPE rotation, with a thin V4-side wrapper that permutes the rope channels back from the halves layout V3 leaves them in to the interleaved layout V4 was trained with — required because V4 is shared-KV (V == K rotated), so V's channel layout flows through ``wo_a`` / ``wo_b``. * Restructure ``conversion_mapping.deepseek_v4`` into two passes: structural prefix renames first (``layers.X.attn.`` → ``model.layers.X.self_attn.``), then specific in-prefix renames on the already-prefixed HF-form keys (``...self_attn.compressor.norm.`` → ``...self_attn.compressor.kv_norm.``). A single-pass ordering loses information in either the forward or reverse direction (overlapping general / specific patterns conflict). * Move the FP8 ``.scale`` → ``.weight_scale_inv`` rename out of the V4 static conversion list and into ``FineGrainedFP8HfQuantizer.update_weight_conversions`` so the rule is only registered when FP8 dequant is actually active. Lets ``test_reverse_loading_mapping`` skip an unrelated FP8 rule on plain saves. Test fixes: * Skip ``test_reverse_loading_mapping`` with a docstring spelling out why the two-pass mapping can't satisfy that test's invariant (its Pass 2 source patterns are HF-form by design; ``test_save_load`` exercises the actual round-trip). * Skip ``test_left_padding_compatibility`` — V4's compressor pre-pools ``compress_rate``-token windows before the attention mask is applied, so left padding shifts window boundaries and folds pad tokens into pooled KV entries (same fundamental limit as RecurrentGemma). * Add ``model.to(torch_device)`` in the ``test_hidden_states_output`` override so cuda inputs don't hit a cpu model. * ``test_tiny_generate_runs`` now passes ``eos_token_id=-1`` so a freshly initialised random model doesn't EOS-stop before max_new_tokens, making the shape assertion deterministic. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
vasqu
left a comment
There was a problem hiding this comment.
Ok I went into details this time, imo the RoPE is messy atm I'm pretty sure it can be refactored into a more normal style
| # E2M1 (FP4) value table — checkpoints sometimes ship MoE experts as packed FP4 | ||
| # (two e2m1 nibbles per int8 byte), so the "weight" dtype lands as ``int8`` / | ||
| # ``float4_e2m1fn_x2`` and we have to unpack before applying the scale grid. | ||
| _FP4_E2M1_LUT = (0.0, 0.5, 1.0, 1.5, 2.0, 3.0, 4.0, 6.0, -0.0, -0.5, -1.0, -1.5, -2.0, -3.0, -4.0, -6.0) |
There was a problem hiding this comment.
Oh that's a bit awkward ngl - guess they did have to make a workaround for that. Only blackwell has native fp4 support iirc
|
FYI I had issues decompressing the model, potentially due to not being able to match to the weight_inverse conversion mappings. Still investigating. import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained(
"deepseek-ai/DeepSeek-V4-Flash",
torch_dtype="auto",
device_map="cpu",
)
tokenizer = AutoTokenizer.from_pretrained("deepseek-ai/DeepSeek-V4-Flash")
save_dir = "DeepSeek-V4-Flash-bf16"
#model.dequantize(torch.bfloat16)
model.save_pretrained(save_dir)
tokenizer.save_pretrained(save_dir)Analysis |
0hujun
left a comment
There was a problem hiding this comment.
Report a bug, the intermediate_size may replace to moe_intermediate_size, cause in mlp layers, deepseek uses moe_intermediate_size. And i have run test on npu but got an error The size of tensor a (2048) must match the size of tensor b (18432) at non-singleton dimension 2. After using moe_intermediate_size instead,test fine.
|
@kylesayrs try with the flag to prevent reverse conversion, did not have time to implement it yet its a bit annoying |
- apply_rotary_pos_emb takes one tensor + handles trailing-rope slicing internally;
rotate_half-style ernie pattern with repeat_interleave; rotary forward emits
half-sized cos/sin (no end-to-end duplication).
- Inherit DeepseekV4RotaryEmbedding from LagunaRotaryEmbedding (partial-rotary
compute_default_rope_parameters).
- Config:
* compress_rates dict keyed by layer type (BC kwargs for compress_rate_csa/hca).
* mlp_layer_types list (BC kwargs for num_hash_layers); MLPBlock dispatches via it.
* qk_rope_head_dim derived from partial_rotary_factor (BC kwarg accepted).
* Drop V3 inheritance + V3-only fields (kv_lora_rank, qk_nope_head_dim, v_head_dim,
n_group, topk_group, first_k_dense_replace, rope_interleave).
- Rename attention/compressor/indexer leaf weights to *_proj convention; add
conversion_mapping rules to load upstream wq_*/wkv/wgate/wo_* names.
- DeepseekV4MLP no longer inherits Qwen2MoeMLP — uses moe_intermediate_size.
- GroupedLinear forward simplified to MHA-style transpose pattern.
- Indexer / compressor: pool window views use -1 last dim (TP-friendly), softmax
in fp32, rope_layer_type as class attr.
- Drop dead self.compress_rate / self.qk_nope_head_dim assignments.
- DeepseekV4UnweightedRMSNorm: extracted weight-less RMSNorm class, used by attention's per-head Q rescale + both HC modules' input rescale. - HyperConnection.forward returns (post, comb, collapsed) — moves the stream collapse into the mHC module instead of the DecoderLayer. - Document the 3 in mHC scale param (pre / post / comb). - DecoderLayer: input_ids in explicit signature (was kwargs.get). - Comment defending the compressor mask pad against FA / SDPA backends. - DeepseekV4Router: unified TopK + Hash routers into one class with a select_indices hook (top-k + e_score_correction_bias vs tid2eid lookup). - Rename buffer ``bias`` → ``e_score_correction_bias`` (cross-model standard); add gate.bias → e_score_correction_bias rule in conversion_mapping. - DeepseekV4Experts: use config.num_local_experts (routes through attribute_map) so FP8 / TP integrations stay robust. - Drop unused self.rotary_emb_compress on the model. - Simplify DeepseekV4ForCausalLM to a bare `pass` inheriting MixtralForCausalLM.
reverse_op was _IdentityOp, so saving a model that had been loaded with dequantize=True dropped the FP8 layout — saved checkpoints lost their weight_scale_inv keys and round-trip through save_pretrained was lossy. Pair the two ops symmetrically: Fp8Dequantize.reverse_op -> Fp8Quantize and Fp8Quantize.reverse_op -> Fp8Dequantize. Fp8Quantize.convert refactored to handle the per-expert save chain (SplitModulelist emits one key per expert -> Fp8Quantize quantizes each), and to pass non-tileable tensors through unchanged (1D norms / biases / odd 2D shapes that were never quantized on the load side).
- Drop the local rotate_half def, import from glm.modeling_glm (identical body). - Iterate set(self.layer_types) in DeepseekV4RotaryEmbedding.__init__ for consistency with the gemma3 idiom. - DeepseekV4MLP inherits LlamaMLP (was a hand-written nn.Module). Config attribute_map routes intermediate_size -> moe_intermediate_size and adds mlp_bias=False, so LlamaMLP's __init__ builds the right shared-expert linears without an override. - DeepseekV4Experts inherits MixtralExperts (was GptOssExperts with an __init__ + _apply_gate override that duplicated everything). MixtralExperts' layout matches V4-Flash's; the only V4-specific bit is the swiglu_limit clamp on gate / up before SiLU, kept inline in the overridden forward. - Split the unified DeepseekV4Router back into DeepseekV4TopKRouter and DeepseekV4HashRouter (Arthur preferred two explicit classes over a conditional select_indices hook). - Drop **_ from DeepseekV4SparseMoeBlock.forward — the layer's caller (DeepseekV4DecoderLayer) already filters kwargs. - DeepseekV4Model now inherits LlamaModel. super().__init__ sets up embed_tokens / norm / rotary_emb / gradient_checkpointing; we override the layer list, swap rotary_emb for the multi-layer-type V4 one, add hc_head, and keep the V4-specific forward.
|
[For maintainers] Suggested jobs to run (before merge) run-slow: auto, deepseek_v4, finegrained_fp8 |
Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>
Configuration is now hand-edited in configuration_deepseek_v4.py — modular no longer defines it, removes it from __all__, and imports it. The converter no longer regenerates the config file (no class with Config suffix means nothing to emit there). __post_init__ is collapsed onto five small _resolve_* methods + a single _apply_legacy_kwargs helper that strips the legacy V3-flavoured kwargs (compress_rate_csa/hca, num_hash_layers, qk_rope_head_dim, compress_ratios) into typed instance fields, so __post_init__ itself reads as a sequence of named steps. Also expand docs/source/en/model_doc/deepseek_v4.md with an Architecture section (hybrid attention / mHC / MoE schedule / cache layers) cross-referenced to the paper sections. Type-check fix: gate the WeightConverter.operations access in quantizer_finegrained_fp8.py with isinstance, so WeightRenaming entries pass through untouched.
…near V4 is shared-KV MQA (num_kv_heads = 1). With TP, q_b_proj is colwise-sharded so the local q has num_heads / tp_size heads while kv stays replicated at one head. The eager / sdpa / flash backends all read module.num_key_value_groups to repeat kv up to q's head count — a fixed global value of num_attention_heads gives the wrong (over-)expansion factor on every rank but the first. Refresh num_key_value_groups from q.shape[1] in DeepseekV4Attention.forward, after the local q has been built, so repeat_kv(key, num_key_value_groups) lifts the single kv head to exactly the rank-local query head count. DeepseekV4GroupedLinear was using a single bmm for the per-group projection. torchao's Float8Tensor (used by tests_tensor_parallel_ci's test_tp_generation_quantized) only fast-paths F.linear; bmm hits an mslk kernel assertion (`bmm is not supported when mslk is not installed`). Replace the bmm with a small per-group F.linear loop — slower for tiny configs, but cuts the torchao dependency and the quantized-TP path now works without mslk.
The bmm was changed to F.linear because torchao's Float8Tensor doesn't fast-path bmm without the mslk kernel. Reverting since a custom V4 FP8 path will land later — we don't want to slow the unquantized GroupedLinear forward (~8x more ops with n_groups=8) just to avoid one CI failure on the quantized-TP test.
|
View the CircleCI Test Summary for this PR: https://huggingface.co/spaces/transformers-community/circle-ci-viz?pr=45643&sha=092dcd |
Draft. Supersedes #45616.