[docs] Update shard size#42749
Conversation
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
Cyrilvallez
left a comment
There was a problem hiding this comment.
Nice, thanks for being quick on this!! It's indeed a very important aspect!
| [`~PreTrainedModel.save_pretrained`] automatically shards checkpoints larger than 50GB. This keeps shard counts low for large models and simplifies file management without significantly slowing load times. | ||
|
|
||
| Each shard is loaded sequentially after the previous shard is loaded, limiting memory usage to only the model size and the largest shard size. | ||
| Shards load sequentially and memory usage is limited to the model size plus the largest shard. Set `max_shard_size` in [`~PreTrainedModel.save_pretrained`] to control the threshold. |
There was a problem hiding this comment.
This is not true anymore!
Parameters are loaded in parallel by default now, (can be deactivated by setting HF_DEACTIVATE_ASYNC_LOAD env variable), and the memory usage is strictly constrained to the model size, EXCEPT if the model needs on-the-fly weight conversions (i.e. most MoE models), in which case the memory peak is model_size + largest_params_needed_in_a_single_conversion, i.e. for the MoE models, model_size + experts_on_one_layer
Cyrilvallez
left a comment
There was a problem hiding this comment.
Nice! Thanks a lot! Just left a final comment
| [`~PreTrainedModel.save_pretrained`] automatically shards checkpoints larger than 50GB. This keeps shard counts low for large models and simplifies file management. | ||
|
|
||
| Each shard is loaded sequentially after the previous shard is loaded, limiting memory usage to only the model size and the largest shard size. | ||
| Parameters load in parallel and peak memory only depends on model size. Set `max_shard_size` in [`~PreTrainedModel.save_pretrained`] to control the threshold. |
There was a problem hiding this comment.
What threshold are we talking about here?
There was a problem hiding this comment.
threshold refers to the maximum checkpoint size before sharding. i updated it so its more clear what "threshold" is :)
* shard size * feedback * add link * clarify * update link
Updates docs to reflect increased shard size in #42734