-
Notifications
You must be signed in to change notification settings - Fork 6.7k
Fix nightly CD for GPU builds #18205
Conversation
|
Hey @mseth10 , Thanks for submitting the PR
CI supported jobs: [edge, miscellaneous, unix-cpu, centos-gpu, website, sanity, windows-gpu, unix-gpu, clang, centos-cpu, windows-cpu] Note: |
|
|
||
| set(CUDACXX "/usr/local/cuda-10.0/bin/nvcc" CACHE STRING "Cuda compiler") | ||
| set(MXNET_CUDA_ARCH "3.0;5.0;6.0;7.0;7.5" CACHE STRING "Cuda architectures") | ||
| set(MXNET_CUDA_ARCH "3.0;5.0;6.0;7.0" CACHE STRING "Cuda architectures") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dropping support for 7.5 cuda arch is it favorable?
For eg for Tesla T4 [G4 instances] cuda arch supported is 7.5
@leezu What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a temporary fix to get the CD working, I checked the builds for cu100 and cu102, both fail because of binary size issues. We should work on adding back 7.5 arch after making sure the build works.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sidenote: 7.0 binaries also run on 7.5
| if platform.system() == 'Darwin': | ||
| shutil.copytree(os.path.join(CURRENT_DIR, 'mxnet-build/3rdparty/mkldnn/build/install/include'), | ||
| os.path.join(CURRENT_DIR, 'mxnet/include/mkldnn')) | ||
| shutil.copytree(os.path.join(CURRENT_DIR, 'mxnet-build/3rdparty/mkldnn/build/install/include'), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Curious how
- previously when nightly CD was working, why was mkldnn include done only for Darwin
- now, to fix nightly CD, this needs to be done for all OS
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This platform condition was added by mistake in some earlier commit. We are fixing that here. No reason to package dnnl header files for only Darwin.
|
Also @mseth10 do you mind updating the description [removing extraneous content] and also adding the Jenkins Pipeline build where you tested it out or a command you used to reproduce & test this fix. Thanks |
|
@ChaiBapchya updated the description. Check it out. |
leezu
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @mseth10
|
Retriggered one previously broken pipeline as http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/restricted-mxnet-cd%2Fmxnet-cd-release-job/detail/mxnet-cd-release-job/1048/pipeline |
…he#18205) * use cmake for cd static build, skip running kvstore tests * update dnnl headers stash location * remove unnecessary platform condition * remove 7.5 arch for cu100, cu101, cu102 Co-authored-by: Ubuntu <ubuntu@ip-172-31-3-62.us-west-2.compute.internal>
Description
This PR fixes nightly CD GPU tests by updating the build toolchain to use cmake static build and updating dnnl headers stash location. It removes 7.5 arch for cu100, cu101, cu102 builds to solve oversized libmxnet.so binary issue with cmake builds.
It also fixes the issue with dnnl headers packaging into nightly build artifacts. Fixes #18120
Here's the link to a broken CD pipeline: http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/restricted-mxnet-cd%2Fmxnet-cd-release-job/detail/mxnet-cd-release-job/1024/pipeline/
Commands used to reproduce and test the fix: