Skip to content

docs(llm): add DeepSeek V4 Flash fine-tuning guide#2053

Merged
HuiyingLi merged 1 commit intoNVIDIA-NeMo:mainfrom
khazic:docs/dsv4-flash-guide
Apr 25, 2026
Merged

docs(llm): add DeepSeek V4 Flash fine-tuning guide#2053
HuiyingLi merged 1 commit intoNVIDIA-NeMo:mainfrom
khazic:docs/dsv4-flash-guide

Conversation

@khazic
Copy link
Copy Markdown
Contributor

@khazic khazic commented Apr 25, 2026

Summary

Adds a Qwen3.5-style fine-tuning guide for DeepSeek V4 Flash, mirroring the format of docs/guides/vlm/qwen3-5.md. Companion to #2039 (the model + recipe PR).

What's added

  • docs/guides/llm/dsv4-flash.md (new) — covers:

    • Architecture: SWA / CSA / HCA hybrid attention via compress_ratios, Hash gate (DeepseekV4HashGate + tid2eid) on the first num_hash_layers, Hyper-Connections with col-norm-first Sinkhorn, dual-base RoPE (θ=10000 / θ=160000 + YaRN), GQA + Q-LoRA + grouped O-LoRA.
    • Checkpoint format: FP4 e2m1fn packed routed experts with FP8 e8m0fnu scales, FP8 e4m3fn 128×128 for the rest, hash-bias drop, Indexer / Compressor key flattening, F8_E8M0 / F8_E5M2 storage-reader backport.
    • Both shipped recipes — deepseek_v4_flash_validate.yaml (4-layer infra harness) and deepseek_v4_flash_hellaswag.yaml (HellaSwag finetune).
    • Standalone Slurm launch script.
    • Layer-parity result vs the DeepSeek inference reference (final-logits cos 0.998, top-1 token matches, every block cos ≥ 0.987 on the 4-layer parity harness).
    • 43-layer full-finetune loss curve.
  • docs/index.md — adds a "Fine-tune DeepSeek V4 Flash" row to the feature table and a toctree entry under "Recipes & E2E Examples".

  • docs/model-coverage/latest-models.md — adds DeepSeek V4 Flash entry at the top with date 2026-04-25.

Test plan

  • Markdown renders locally.
  • All cross-links resolve to existing files / sections.
  • No CLAUDE co-author tag; commit is signed off.
  • CI (docs build + linting).

Adds a Qwen3.5-style fine-tuning guide for DeepSeek V4 Flash, covering
the architecture (SWA / CSA / HCA hybrid attention via compress_ratios,
Hash gate, Hyper-Connections, dual-base RoPE, Q-LoRA + grouped O-LoRA),
checkpoint format support (FP4 e2m1fn + FP8 e8m0fnu / e4m3fn / e5m2),
both shipped recipes (validate harness + HellaSwag), and the 4-layer
parity result vs the DeepSeek inference reference.

- docs/guides/llm/dsv4-flash.md  (new)
- docs/index.md                  (feature table + toctree entry)
- docs/model-coverage/latest-models.md  (entry under 2026-04-25)

Signed-off-by: khazic <khazzz1c@gmail.com>
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented Apr 25, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@HuiyingLi
Copy link
Copy Markdown
Contributor

/ok to test 983d29e

@HuiyingLi HuiyingLi enabled auto-merge (squash) April 25, 2026 08:36
@HuiyingLi HuiyingLi merged commit 52fd3bb into NVIDIA-NeMo:main Apr 25, 2026
30 of 31 checks passed
@khazic
Copy link
Copy Markdown
Contributor Author

khazic commented Apr 25, 2026

Pushed 0ca1d03 to drop the deepseek_v4_flash_validate.yaml reference from the guide (the bullet under Launch Training plus the Quick infrastructure validation subsection). The validate harness is an internal smoke-test config — the user-facing guide should advertise only the HellaSwag recipe. Ready for re-review.

HuiyingLi added a commit that referenced this pull request Apr 26, 2026
…2054)

* docs(llm): drop validate-yaml reference from DeepSeek V4 Flash guide

Removes the validate-yaml bullet under "Launch Training" and the
"Quick infrastructure validation" subsection.  The validate harness
is an internal smoke-test config, not a user-facing finetune recipe;
the guide should advertise only the HellaSwag recipe.

Follow-up to #2053 (the original change was force-pushed after the
PR had already merged, so the deletion did not land on main).

Signed-off-by: khazic <khazzz1c@gmail.com>

* docs(llm): add DeepSeek V4 Flash to README + model-coverage index

Mirrors the per-model rollout pattern used for MiniMax-M2.7 (#1785):
news entry at the top of the README, a dedicated model-coverage page
under deepseek-ai/, and registration of the new page in the LLM index
(architecture table + toctree).

- README.md                                            (news entry)
- docs/model-coverage/llm/deepseek-ai/dsv4-flash.md    (new)
- docs/model-coverage/llm/index.md                     (table + toctree)

Signed-off-by: Huiying Li <willwin.lee@gmail.com>

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs(llm): use plain link for hellaswag yaml until model PR lands

The {download} directive on the recipe yaml fails the Sphinx build
with `download.not_readable` because
examples/llm_finetune/deepseek_v4/deepseek_v4_flash_hellaswag.yaml
is added by the model PR (#2039), which has not yet landed on main.
Use a plain GitHub link until #2039 merges; a follow-up can switch
back to {download} once the file is on main.

Signed-off-by: khazic <khazzz1c@gmail.com>

---------

Signed-off-by: khazic <khazzz1c@gmail.com>
Co-authored-by: Huiying Li <willwin.lee@gmail.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants