synchronize in the beginning of all CUDA functions#2661
Merged
wanghan-iapcm merged 1 commit intodeepmodeling:develfrom Jul 9, 2023
Merged
synchronize in the beginning of all CUDA functions#2661wanghan-iapcm merged 1 commit intodeepmodeling:develfrom
wanghan-iapcm merged 1 commit intodeepmodeling:develfrom
Conversation
Fix deepmodeling#2660. TensorFlow made streams non-blocking in tensorflow/tensorflow@9d12620. Our own CUDA functions uses the default streams that is different from TensorFlow's, so we need to synchronize in the beginning of all functions. In the future, it might be worth using the same stream as TensorFlow's to improve the performance. Signed-off-by: Jinzhe Zeng <jinzhe.zeng@rutgers.edu>
Codecov ReportPatch coverage has no change and project coverage change:
Additional details and impacted files@@ Coverage Diff @@
## devel #2661 +/- ##
==========================================
- Coverage 78.35% 78.35% -0.01%
==========================================
Files 235 235
Lines 24473 24473
Branches 1469 1469
==========================================
- Hits 19177 19176 -1
- Misses 4906 4907 +1
Partials 390 390 ☔ View full report in Codecov by Sentry. |
wanghan-iapcm
approved these changes
Jul 9, 2023
wanghan-iapcm
pushed a commit
that referenced
this pull request
Sep 22, 2023
Merge `source/lib/src/cuda` and `source/lib/src/rocm` into `source/lib/src/gpu`. - Define macros `gpuGetLastError`, `gpuDeviceSynchronize`, `gpuMemcpy`, `gpuMemcpyDeviceToHost`, `gpuMemcpyHostToDevice`, and `gpuMemset` to make them available for both CUDA and ROCm. - Use `<<< >>> syntax` for both CUDA and ROCm. Per ROCm/hip@cf78d85, it has been supported in HIP since 2018. - Fix several int const numbers that should be double or float. - For tabulate: - Fix `WARP_SIZE` for ROCm. Per pytorch/pytorch#64302, WARP_SIZE can be 32 or 64, so it should not be hardcoded to 64. - Add `GpuShuffleSync`. Per ROCm/hip#1491, `__shfl_sync` is not supported by HIP. - After merging the code, #1274 should also work for ROCm. - Use the same `ii` for #830 and #2357. Although both of them work, `ii` has different meanings in these two PRs, but now it should be the same. - However, `ii` in `tabulate_fusion_se_a_fifth_order_polynomial` (rocm) added by #2532 is wrong. After merging the codes, it should be corrected. - Optimization in #830 was not applied to ROCm. - `__syncwarp` is not supported by ROCm. - After merging the code, #2661 will be applied to ROCm. Although TF ROCm stream is still blocking (https://github.com/tensorflow/tensorflow/blob/9d1262082e761cd85d6726bcbdfdef331d6d72c6/tensorflow/compiler/xla/stream_executor/rocm/rocm_driver.cc#L566), we don't know whether it will change to non-blocking. - There are several other differences between CUDA and ROCm. --------- Signed-off-by: Jinzhe Zeng <jinzhe.zeng@rutgers.edu>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fix #2660.
TensorFlow made streams non-blocking in tensorflow/tensorflow@9d12620. Our own CUDA functions uses the default streams that is different from TensorFlow's, so we need to synchronize in the beginning of all functions.
In the future, it might be worth using the same stream as TensorFlow's to improve the performance.