Add GDS support for safetensors loading #45113
Add GDS support for safetensors loading #45113cyyever wants to merge 1 commit intohuggingface:mainfrom
Conversation
430ea29 to
e38eb51
Compare
There was a problem hiding this comment.
Hi thanks for the contribution! We're working on these kinds of topics over in the https://github.com/huggingface/safetensors repo directly, probably would be best to have GDS support directly in lib rather than transformers, cc @ArthurZucker.
Would be curious to see the following:
- larger model load
- distributed loading
- running iostat during load to measure throughput
- fio theoritical max throughput on your machine for reference
- warm vs cold cache test (
sudo sync && echo 3 | sudo tee /proc/sys/vm/drop_caches) - OS, I assume linux here?
- I assume each issued read is not sequential since there might be overlap between tensor ranges as you're pulling full tensor data for each slice, IIUC, wonder how much it impacts reads in the context of GDS though. Would be cool if you could test that but it'll require, I think, a more involved setup.
Not sure this will scale well with larger models, lmk if you want to test some more after my feedback. IMO not worth it as we're going to tackle this in the coming weeks in a "specialised manner" over in https://github.com/mfuntowicz/hmll, cc @mfuntowicz
| return self._shape | ||
|
|
||
| def __getitem__(self, slices): | ||
| tensor = self._gds_file.get_tensor(self._name, self._target_device) |
There was a problem hiding this comment.
Does this mean you pull the full tensor for each slice that is requested? What does your memory footprint on device look like once the model is loaded?
There was a problem hiding this comment.
Not sure your question. The purpose is to load tensors via GDS api, which works best with aligned file offsets.
There was a problem hiding this comment.
How do you guarantee file offsets are aligned here? safetensors files aren't written with that constraint in mind, you need to do some extra processing (we're thinking of supporting writing aligned offsets, but it's tricky wrt backwards compatibility).
What I'm asking, is that from your implementation, it seems you're loading the full tensor self._name on each call to GdsSlice.__getitem__. That is why I asked what the memory footprint (total used memory on device) looks like. If you can run nvidia-smi after loading the model, that'd be a good test to see if that happens.
|
@McPatate I saw a similar PR of safetensor repo but unfortunately it was denied. I prefer to apply GDS in transformers because it provides global overview of IO bottlenecks in large-scale LLM training/inference scenarios. For your concerns:
|
For larger I meant, larger than qwen 7b! But I assume we're entering in distributed territory after that size, so consider these two points as the same. I would appreciate a benchmark to see how you're impl is performing in that scenario.
No way to run your code on an isolated machine?
I was asking if you could run
There are scenarios where cold cache runs make sense (e.g. loading a model after a restart with files already present on disk).
I'm not convinced you are issuing reads sequentially, as for each slice you read the full tensor, again IIUC. |
What does this PR do?
This PR adds GPU Direct Storage (GDS) support for safetensors model loading via
torch.cuda.gds.GdsFile. GDS is disabled by default,HF_ENABLE_GDS=1env is used to enable it.Benchmark
A100 PCIe 40GB, Samsung NVMe 3.5TB, GDS compat mode (no nvidia-fs) in
from_pretrained:Code Agent Policy
The Transformers repo is currently being overwhelmed by a large number of PRs and issue comments written by
code agents. We are currently bottlenecked by our ability to review and respond to them. As a result,
we ask that new users do not submit pure code agent PRs at this time.
You may use code agents in drafting or to help you diagnose issues. We'd also ask autonomous "OpenClaw"-like agents
not to open any PRs or issues for the moment.
PRs that appear to be fully agent-written will probably be closed without review, and we may block users who do this
repeatedly or maliciously.
This is a rapidly-evolving situation that's causing significant shockwaves in the open-source community. As a result,
this policy is likely to be updated regularly in the near future. For more information, please read
CONTRIBUTING.md.Before submitting
Pull Request section?
to it if that's the case.
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.