Skip to content

fix bug when getting the real accelerator's device number#2874

Closed
faaany wants to merge 3 commits intohuggingface:mainfrom
faaany:device-fix
Closed

fix bug when getting the real accelerator's device number#2874
faaany wants to merge 3 commits intohuggingface:mainfrom
faaany:device-fix

Conversation

@faaany
Copy link
Copy Markdown
Contributor

@faaany faaany commented Jun 20, 2024

What does this PR do?

This PR is a follow-up fix for PR #2826 and I want to correct my statement in that PR that torch.device(d).type == "xpu" is enough to check the xpu device just like the case in npu and mlu. This was my mistake. In fact, torch.device(0).type will always return "cuda" on XPU as can be seen from the pytorch code and from the pytorch offical doc at least for now. But we are working on a PR to support it in the future pytorch version. Also for NPU path, I think torch.device(0).typewill returncuda` as can be seen here.

In addition, users might pass device id that exceeds the available device count. For this case, we will not count that incorrect id to num_devices when calculating the balanced memory. So this PR actually fixes 2 issues:

  • num_devices for non-cuda devices will always be 0
  • num_devices will include device index that is larger than the available device number

Who can review?

@SunMarc and @muellerzr

@yao-matrix
Copy link
Copy Markdown
Contributor

OK for me.

@faaany faaany marked this pull request as ready for review June 20, 2024 09:34
@HuggingFaceDocBuilderDev
Copy link
Copy Markdown

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@faaany
Copy link
Copy Markdown
Contributor Author

faaany commented Jun 26, 2024

@SunMarc @muellerzr

Copy link
Copy Markdown
Member

@SunMarc SunMarc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your work @faaany ! I left a comment

Comment thread src/accelerate/utils/modeling.py Outdated
Copy link
Copy Markdown
Member

@SunMarc SunMarc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good ! I left a comment about mlu since i'm not sure this is safe to change the expected_device_type.

Comment on lines -994 to -995
elif is_mlu_available():
expected_device_type = "mlu"
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue you shared @faaany is about the to() methods for mlu. I'm not sure if torch.device(d).type will really return cuda with mlu. Do you have any insights @huismiling since you added the support to mlu ? To be safe, I would suggest reverting this change.

Copy link
Copy Markdown
Contributor Author

@faaany faaany Jun 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good idea, I will update it. But I am also very curious about the behavior on npu:

Hi @statelesshz , could you help us verify what torch.device(0) returns on NPU? It is cuda or npu? Thanks a lot!

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@faaany Hi, MLU devices type is mlu .

>>> torch.device(0).type
'mlu'

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @huismiling! Could you update @faaany ?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@huismiling thanks for updating!

@faaany
Copy link
Copy Markdown
Contributor Author

faaany commented Jul 3, 2024

Good news is that during this time our proposal to make torch.device(0) return 'xpu' on xpu got approved by stock pytorch (PR link). To avoid "reverse-engineering", let me close this PR. Thanks so much for the discussion! @SunMarc

@faaany faaany closed this Jul 3, 2024
@muellerzr
Copy link
Copy Markdown
Contributor

That's great @faaany !

@SunMarc
Copy link
Copy Markdown
Member

SunMarc commented Jul 3, 2024

Nice :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants