Skip to content

Conversation

@Disty0
Copy link
Contributor

@Disty0 Disty0 commented Jul 7, 2023

What does this PR do?

Add a gpu argument to enable_sequential_cpu_offload and enable_model_cpu_offload instead of assuming it's cuda only.
We can change torch.cuda.empty_cache() with another function from outside but we can't easily change the GPU without this PR.

Before submitting

  • [N] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • [Y] Did you read the contributor guideline?
  • [Y] Did you read our philosophy doc (important for complex PRs)?
  • [N] Was this discussed/approved via a Github issue or the forum? Please add a link to it if that's the case.
  • [N] Did you make sure to update the documentation with your changes?
    Note: I couldn't find a doc about arguments for cpu offload.
    Here are the;
    documentation guidelines, and
    here are tips on formatting docstrings.
  • [N] Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

Core library:

@sayakpaul
Copy link
Member

It's quite a large PR. We would maybe like to first discuss the impact and then proceed to reviewing it. Pinging @pcuenca @patrickvonplaten.

@patrickvonplaten
Copy link
Contributor

I'm generally fine with it, but what other devices are you targeting exactly? @pcuenca wdyt?

@pcuenca
Copy link
Member

pcuenca commented Jul 12, 2023

I tested on mps and it works. However, I'm not sure it helps because memory is unified in M1/M2 computers, so it doesn't matter that we move the model between cpu and the mps device: if the model is too large we'll run out of memory no matter what. I could see two potential benefits, but I think neither can be realized now:

  • Offload to disk instead of cpu, to allow running very large models. I don't think this use case is currently supported by either enable_sequential_cpu_offload or enable_model_cpu_offload (but I might be wrong).
  • Use this on a computer with several mps devices. I'm currently not aware of such a system.

What type of hardware do you have in mind for this, @Disty0? Could you maybe clarify the use case here so we can better understand the issue? Thank you!

@Disty0
Copy link
Contributor Author

Disty0 commented Jul 12, 2023

I am targeting xpu devices (Intel Arc GPUs) and possibly dml device (DirectML GPUs) too.

enable_model_cpu_offload works fine on xpu with this PR.
But enable_sequential_cpu_offload doesn't work yet since PyTorch IPEX doesn't work nicely with a meta device.

@lshqqytiger
Copy link

enable_sequential_cpu_offload works with DirectML very well. It will be nice if cpu offloading is available for more devices not only cuda.

@pcuenca
Copy link
Member

pcuenca commented Jul 13, 2023

enable_sequential_cpu_offload works with DirectML very well.

That's interesting! Could you please provide an example of use?

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.

@Disty0
Copy link
Contributor Author

Disty0 commented Jul 14, 2023

enable_sequential_cpu_offload works with DirectML very well.

That's interesting! Could you please provide an example of use?

Something along these lines for Intel ARC GPUs:

torch.cuda.empty_cache = torch.xpu.empty_cache
pipeline.enable_model_cpu_offload(gpu="xpu", gpu_id=0)

@pcuenca
Copy link
Member

pcuenca commented Jul 15, 2023

Interesting! So I would be in favor of reviewing and merging this, sounds like it could be useful for some hardware architectures.

@lshqqytiger
Copy link

enable_sequential_cpu_offload works with DirectML very well.

That's interesting! Could you please provide an example of use?

Similar to xpu.

pipeline.enable_sequential_cpu_offload(gpu="privateuseone", gpu_id=0)

But I prefer to provide torch.device itself like below.

device = torch_directml.device(0)
pipeline.enable_sequential_cpu_offload(device=device)

@patrickvonplaten
Copy link
Contributor

Agree with @lshqqytiger here - that's also what I added here: #4114 . Should we maybe first go with #4114 and then possible refactor enable_model_cpu_offload in a similar way?

@Disty0
Copy link
Contributor Author

Disty0 commented Jul 17, 2023

Agree with @lshqqytiger here - that's also what I added here: #4114 . Should we maybe first go with #4114 and then possible refactor enable_model_cpu_offload in a similar way?

Using device will be better, i too agree with that.

#4114 seems like a better approach than this PR.
We can close this and continue with #4114.
And refactor enable_model_cpu_offload after that.

@patrickvonplaten
Copy link
Contributor

patrickvonplaten commented Jul 18, 2023

Sorry for the duplicated work here @Disty0. Should we maybe try to do the same refactor for enable_model_cpu_offload by making every class define a class attribute describing the order in which the hooks should be applied? cc @williamberman @pcuenca wdyt?

@Disty0
Copy link
Contributor Author

Disty0 commented Jul 18, 2023

Sorry for the duplicated work here @Disty0. Should we maybe try to do the same refactor for enable_model_cpu_offload by making every class define a class attribute describing the order in which the hooks should be applied? cc @williamberman @pcuenca wdyt?

Yes, that would be better. We won't have to change 68+ files if we want to change something in the future that way.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants