Skip to content
This repository was archived by the owner on Nov 17, 2023. It is now read-only.

Conversation

@ptrendx
Copy link
Member

@ptrendx ptrendx commented Oct 16, 2020

Description

Starting with CUDA 11.1 it is possible to run programs compiled with newer CUDA toolkit with older driver (as long as the major version is the same, e.g. CUDA 11.1 works with CUDA 11.0 driver) without the compat library. This requires a few changes to API used by nvRTC however, which are addressed by this PR.

Checklist

Essentials

  • Changes are complete (i.e. I finished coding on this PR)
  • Code is well-documented

Comments

  • Change was tested in internal CI using CUDA 11.1 toolkit and Titan RTX with 450.80.02 driver.

@ptrendx ptrendx requested a review from DickJC123 October 16, 2020 22:41
@mxnet-bot
Copy link

Hey @ptrendx , Thanks for submitting the PR
All tests are already queued to run once. If tests fail, you can trigger one or more tests again with the following commands:

  • To trigger all jobs: @mxnet-bot run ci [all]
  • To trigger specific jobs: @mxnet-bot run ci [job1, job2]

CI supported jobs: [sanity, windows-cpu, centos-gpu, miscellaneous, website, unix-gpu, edge, windows-gpu, centos-cpu, clang, unix-cpu]


Note:
Only following 3 categories can trigger CI :PR Author, MXNet Committer, Jenkins Admin.
All CI tests must pass before the PR can be merged.

@lanking520 lanking520 added the pr-awaiting-testing PR is reviewed and waiting CI build and test label Oct 16, 2020
@ptrendx
Copy link
Member Author

ptrendx commented Oct 16, 2020

Note: this does not touch the legacy RTC part (https://mxnet.apache.org/versions/1.6/api/python/docs/api/mxnet/rtc/index.html) - what is the plan for it @szha?

@lanking520 lanking520 added pr-work-in-progress PR is still work in progress and removed pr-awaiting-testing PR is reviewed and waiting CI build and test labels Oct 16, 2020
@szha
Copy link
Member

szha commented Oct 17, 2020

I think we need to continue to support mx.rtc

@ptrendx
Copy link
Member Author

ptrendx commented Oct 18, 2020

Are there any people using it? The interface is not great, since the CudaModule from there is not even an operator so can't be used in a model. We could reuse the recent RTC stuff to make it much better experience (and actually potentially pretty useful).

That said, this PR does not touch that functionality (because the compilation options there are set by the user). I could make it so if you specify the proper option (--gpu-architecture=sm_XX insteaf of compute_XX yourself) it gets the cubin instead of ptx so it works with enhanced compatibility.

@szha
Copy link
Member

szha commented Oct 18, 2020

Agreed. In 2.0 we can change the interface.

@lanking520 lanking520 added pr-awaiting-testing PR is reviewed and waiting CI build and test pr-work-in-progress PR is still work in progress and removed pr-work-in-progress PR is still work in progress pr-awaiting-testing PR is reviewed and waiting CI build and test labels Oct 19, 2020
@ptrendx
Copy link
Member Author

ptrendx commented Oct 20, 2020

@mxnet-bot run ci [centos-cpu, centos-gpu, edge, miscellaneous]

@mxnet-bot
Copy link

Jenkins CI successfully triggered : [edge, centos-cpu, miscellaneous, centos-gpu]

@ptrendx
Copy link
Member Author

ptrendx commented Oct 23, 2020

@mxnet-bot run ci [centos-cpu, unix-gpu, edge, website]

@mxnet-bot
Copy link

Jenkins CI successfully triggered : [edge, website, unix-gpu, centos-cpu]

@lanking520 lanking520 added pr-awaiting-testing PR is reviewed and waiting CI build and test pr-awaiting-review PR is waiting for code review and removed pr-work-in-progress PR is still work in progress pr-awaiting-testing PR is reviewed and waiting CI build and test labels Oct 23, 2020
Copy link
Contributor

@DickJC123 DickJC123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For others interested in understanding better the motivation behind this PR, I suggest https://docs.nvidia.com/deploy/pdf/CUDA_Compatibility.pdf . One paragraph worth repeating from that doc is:

To use other CUDA APIs introduced in a minor release (that require a new
driver), one would have to implement fallbacks or fail gracefully. This situation
is not different from what is available today where developers use macros to
compile out features based on CUDA versions. Users should refer to the CUDA
headers and documentation for new CUDA APIs introduced in a release.

Thus, it's fair to use an 11.1 feature that is supported by both 11.1 and 11.0 kernel-mode drivers. Before using an 11.1 feature that requires an 11.1 kernel-mode driver, one should check dynamically for that feature's presence at runtime, as suggested in the document section 3.2 "Handling New CUDA Features." This is particularly important to pay attention to while the upstream CI testing has no enhanced-compatibility build.

Comment on lines +85 to +86
const auto getSize = use_cubin ? nvrtcGetCUBINSize : nvrtcGetPTXSize;
const auto getFunc = use_cubin ? nvrtcGetCUBIN : nvrtcGetPTX;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, while nvrtcGetCUBINSize() and nvrtcGetCUBIN() are not yet in the nvrtc docs, their use is described in https://docs.nvidia.com/deploy/cuda-compatibility/

@lanking520 lanking520 added pr-awaiting-testing PR is reviewed and waiting CI build and test and removed pr-awaiting-review PR is waiting for code review labels Nov 2, 2020
@lanking520 lanking520 added pr-work-in-progress PR is still work in progress and removed pr-awaiting-testing PR is reviewed and waiting CI build and test labels Nov 2, 2020
@ptrendx
Copy link
Member Author

ptrendx commented Nov 3, 2020

@mxnet-bot run ci [unix-cpu]

@mxnet-bot
Copy link

Jenkins CI successfully triggered : [unix-cpu]

@lanking520 lanking520 added pr-awaiting-testing PR is reviewed and waiting CI build and test pr-awaiting-review PR is waiting for code review and removed pr-work-in-progress PR is still work in progress pr-awaiting-testing PR is reviewed and waiting CI build and test labels Nov 3, 2020
@DickJC123 DickJC123 merged commit b33fbd1 into apache:master Nov 3, 2020
vidyaravipati pushed a commit to vidyaravipati/incubator-mxnet that referenced this pull request Nov 11, 2020
* Guard RTC better

* Use nvrtcGetCUBIN

* Fix lint

* Enable cubin loading in legacy rtc path

* Fixes from review
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

pr-awaiting-review PR is waiting for code review

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants