checkpoint utility: optimize to_maxtext, add deepseek#3184
checkpoint utility: optimize to_maxtext, add deepseek#3184copybara-service[bot] merged 1 commit intomainfrom
Conversation
Codecov Report❌ Patch coverage is 📢 Thoughts on this report? Let us know! |
0798438 to
5702326
Compare
|
🤖 Hi @shuningjin, I've received your request, and I'm working on it now! You can track my progress in the logs for more details. |
There was a problem hiding this comment.
This Pull Request introduces significant improvements and optimizations to the checkpoint conversion process in MaxText, specifically focusing on DeepSeek model support (V2-16B, V3-671B, and V3.2-671B). The implementation of LazyHFLoader and the adoption of dtype="auto" in Hugging Face loading are excellent additions that substantially reduce memory overhead, making the conversion of extremely large models more feasible.
🔍 General Feedback
- Efficiency: The shift towards memory-efficient loading strategies is a major highlight. Using
safetensorson-demand avoids redundant memory consumption during theto_maxtextconversion. - Support: Comprehensive support for DeepSeek's MLA architecture and MoE experts is well-integrated into both
hf_shape.pyandparam_mapping.py. - Maintainability: The refactoring of
forward_pass_logit_checker.pyand the grouping of reshape hooks inparam_mapping.pysignificantly improve code clarity and ease of future extension.
RissyRan
left a comment
There was a problem hiding this comment.
LGTM at high level! A few minor comments.
91acedc to
efdfc8e
Compare
190dee7 to
22da4de
Compare
RissyRan
left a comment
There was a problem hiding this comment.
Thanks for the change! Just a minor comment to make logging more useful.
150f104 to
c93fde2
Compare
c93fde2 to
b1a5feb
Compare
Description
Optimize
to_maxtextloading and savingto_maxtexteager loading and saving pipelines. By controlling data type footprint, we reduces memory and delivers speedup.deepseek3.2-671b,deepseek3-671b,deepseek2-16b).Problem
Previously, eager load defaulted to
transformers_class.from_pretrained(...), which loaded, converted, and saved checkpoints infloat32.What Changed
This PR introduces two optimized eager loading methods and adds the ability to save in
bfloat16:transformers_class.from_pretrained(..., dtype="auto")to load the original tensor type.safetensors.safe_open(..., framework="pt")to load natively from safetensors. Similar to Method 1, this can either process remote repo or local path.bfloat16as the recommended save option (withfloat32retained as a backup). This works for both eager load and lazy load.Why It Matters (Impact & Benefits)
gpt-oss-120b: 1009.72 GB -> 511.47 GBfloat32casting and NumPy bottlenecks.gpt-oss-120b: 78 min -> 1sdeepseek3-671b: 7.5 hr -> 4 mingpt-oss-120b: 134.86 min -> 96.22 min. 30% speedup for 120B.deepseek3-671bpreviously OOM'd on 3.7TB RAM; it is now feasible with a peak of 2854.90 GB and a total conversion time of ~9.5 hours.gpt-oss-120bdropped from 100.17 GiB to 74.23 GiB).safetensors) allows us todeepseek-ai/DeepSeek-V3.2is still in PR as of 2026-03).layers.61is not loaded bydeepseek-ai/DeepSeek-V3)Other changes
to_maxtext: ReuseHF_MODEL_CONFIGSrather thantransformers.AutoConfig. This accommodates model without full HuggingFace code support (e.g.,deepseek3.2). This also aligns with howto_huggingfaceuses config.to_huggingface: Initially, maxtext weights are loaded viaset_decode_state, which usesconfig.weight_dtype. It was subsequently changed to orbax restore, which loads the weight as is. To control save dtype, we now explicitly cast it toconfig.weight_typeinutils._process.Tests
Test details in doc.
1. Performance (
gpt-oss-120b)hf-bf16to maxtext scanned.bfloat16save.2. Functionality (
qwen3-0.6b)hf-bf16to maxtext scanned.{lazy, method1, method2} x {bfloat16, float32}.3. Scalability (
deepseek3-671b){to_maxtext} x {scanned}.to_maxtextfor this model class. (Previously, onlyto_huggingfacewas feasible due to OOM constraints).4. New DeepSeek Mappings
deepseek2-16b:{to_maxtext, to_huggingface} x {scanned, unscanned}deepseek3-671b:{to_maxtext} x {unscanned}deepseek3.2:{to_maxtext} x {scanned, unscanned}. Noteto_huggingfaceis not enabled as DeepSeek32ForCausalLM is not supported yet, follow up in b/496411531.Examples:
Checklist
Before submitting this PR, please make sure (put X in square brackets):
gemini-reviewlabel.