speedup scan_nlist kernel #1028
Merged
amcadmus merged 2 commits intodeepmodeling:develfrom Aug 25, 2021
Merged
Conversation
Codecov Report
@@ Coverage Diff @@
## devel #1028 +/- ##
==========================================
- Coverage 82.86% 75.01% -7.86%
==========================================
Files 119 86 -33
Lines 10110 6924 -3186
==========================================
- Hits 8378 5194 -3184
+ Misses 1732 1730 -2 Continue to review full report at Codecov.
|
Member
|
@pkulzy could you plz check it this PR works fine on ROCM? Thanks! |
Contributor
|
OK. These changes have passed UTs. The specific acceleration is better than I need to wait until the cluster environment is better to measure |
galeselee
approved these changes
Aug 25, 2021
amcadmus
approved these changes
Aug 25, 2021
gzq942560379
pushed a commit
to HPC-AI-Team/deepmd-kit
that referenced
this pull request
Sep 2, 2021
* speedup cuda kernel scan_nlist * fix no-pbc error
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Our profiling of the example water benchmark system shows that the
scan_nlistkernel within the$deepmd_source_dir/source/lib/src/cuda/neighbor_list.cuconsumes more than 7% of kernel execution time during thedp trainprocess. And it consumes more than 20% of the kernel execution time in thedp init-frz-modelprocess.The original
scan_nlistkernel uses one thread to scan the neighbor list of a central atom. This is inefficient within the training process. Given the training nloc usually smaller than the threads number per cuda block,scan_nlistwill typically launch only one cuda thread-block at each training step and causes a huge waste of computing resources.In the new implementation, I use the nvidia/cub head library to parallelize the scan kernel, and it speedup the
scan_nlistkernel by more than 50 times. The total training speed-up ratio is about 3%, which is mainly because there's still a considerable part of the training process does not run on the gpu kernel functions. And we still need to perform more detailed profiling of the training process.The lcurve.out of cpu training(lcurve-cpu.out), gpu training(lcurve-gpu.out) and new scan training(lcurve-parallel.out) show the same results:
This ensures the correctness of the new implementation. And all UTs have passed in my local V100 workstation.