speedup scan_nlist kernel by denghuilu · Pull Request #1028 · deepmodeling/deepmd-kit

denghuilu · 2021-08-25T03:37:49Z

Our profiling of the example water benchmark system shows that the scan_nlist kernel within the $deepmd_source_dir/source/lib/src/cuda/neighbor_list.cu consumes more than 7% of kernel execution time during the dp train process. And it consumes more than 20% of the kernel execution time in the dp init-frz-model process.

The original scan_nlist kernel uses one thread to scan the neighbor list of a central atom. This is inefficient within the training process. Given the training nloc usually smaller than the threads number per cuda block, scan_nlist will typically launch only one cuda thread-block at each training step and causes a huge waste of computing resources.

In the new implementation, I use the nvidia/cub head library to parallelize the scan kernel, and it speedup the scan_nlist kernel by more than 50 times. The total training speed-up ratio is about 3%, which is mainly because there's still a considerable part of the training process does not run on the gpu kernel functions. And we still need to perform more detailed profiling of the training process.

The lcurve.out of cpu training(lcurve-cpu.out), gpu training(lcurve-gpu.out) and new scan training(lcurve-parallel.out) show the same results:

root se_e2_a $ head lcurve-cpu.out
#  step      rmse_val    rmse_trn    rmse_e_val  rmse_e_trn    rmse_f_val  rmse_f_trn         lr
      0      2.65e+01    2.76e+01      6.77e-01    6.79e-01      8.38e-01    8.71e-01    1.0e-03
      1      2.64e+01    2.78e+01      4.32e-01    4.31e-01      8.35e-01    8.78e-01    1.0e-03
      2      2.53e+01    2.52e+01      1.33e-01    1.28e-01      7.99e-01    7.98e-01    1.0e-03
      3      2.44e+01    2.31e+01      9.82e-02    9.90e-02      7.72e-01    7.30e-01    1.0e-03
      4      2.78e+01    2.57e+01      2.39e-01    2.38e-01      8.80e-01    8.12e-01    1.0e-03
      5      2.54e+01    2.54e+01      3.01e-01    3.04e-01      8.04e-01    8.04e-01    1.0e-03
      6      2.58e+01    2.49e+01      2.98e-01    3.03e-01      8.16e-01    7.88e-01    1.0e-03
      7      2.62e+01    2.36e+01      2.37e-01    2.40e-01      8.29e-01    7.46e-01    1.0e-03
      8      2.52e+01    2.58e+01      1.80e-01    1.79e-01      7.97e-01    8.15e-01    1.0e-03

root se_e2_a $ head lcurve-gpu.out
#  step      rmse_val    rmse_trn    rmse_e_val  rmse_e_trn    rmse_f_val  rmse_f_trn         lr
      0      2.65e+01    2.76e+01      6.77e-01    6.79e-01      8.38e-01    8.71e-01    1.0e-03
      1      2.64e+01    2.78e+01      4.32e-01    4.31e-01      8.35e-01    8.78e-01    1.0e-03
      2      2.53e+01    2.52e+01      1.33e-01    1.28e-01      7.99e-01    7.98e-01    1.0e-03
      3      2.44e+01    2.31e+01      9.82e-02    9.90e-02      7.72e-01    7.30e-01    1.0e-03
      4      2.78e+01    2.57e+01      2.39e-01    2.38e-01      8.80e-01    8.12e-01    1.0e-03
      5      2.54e+01    2.54e+01      3.01e-01    3.04e-01      8.04e-01    8.04e-01    1.0e-03
      6      2.58e+01    2.49e+01      2.98e-01    3.03e-01      8.16e-01    7.88e-01    1.0e-03
      7      2.62e+01    2.36e+01      2.37e-01    2.40e-01      8.29e-01    7.46e-01    1.0e-03
      8      2.52e+01    2.58e+01      1.80e-01    1.79e-01      7.97e-01    8.15e-01    1.0e-03

root se_e2_a $ head lcurve-parallel.out
#  step      rmse_val    rmse_trn    rmse_e_val  rmse_e_trn    rmse_f_val  rmse_f_trn         lr
      0      2.65e+01    2.76e+01      6.77e-01    6.79e-01      8.38e-01    8.71e-01    1.0e-03
      1      2.64e+01    2.78e+01      4.32e-01    4.31e-01      8.35e-01    8.78e-01    1.0e-03
      2      2.53e+01    2.52e+01      1.33e-01    1.28e-01      7.99e-01    7.98e-01    1.0e-03
      3      2.44e+01    2.31e+01      9.81e-02    9.90e-02      7.72e-01    7.30e-01    1.0e-03
      4      2.78e+01    2.57e+01      2.39e-01    2.38e-01      8.80e-01    8.12e-01    1.0e-03
      5      2.54e+01    2.54e+01      3.01e-01    3.04e-01      8.04e-01    8.04e-01    1.0e-03
      6      2.58e+01    2.49e+01      2.98e-01    3.03e-01      8.16e-01    7.88e-01    1.0e-03
      7      2.62e+01    2.36e+01      2.37e-01    2.40e-01      8.29e-01    7.46e-01    1.0e-03
      8      2.52e+01    2.58e+01      1.80e-01    1.79e-01      7.97e-01    8.15e-01    1.0e-03

This ensures the correctness of the new implementation. And all UTs have passed in my local V100 workstation.

codecov-commenter · 2021-08-25T03:42:22Z

Codecov Report

Merging #1028 (28e16e1) into devel (c0874f0) will decrease coverage by 7.85%.
The diff coverage is n/a.

@@            Coverage Diff             @@
##            devel    #1028      +/-   ##
==========================================
- Coverage   82.86%   75.01%   -7.86%     
==========================================
  Files         119       86      -33     
  Lines       10110     6924    -3186     
==========================================
- Hits         8378     5194    -3184     
+ Misses       1732     1730       -2

Impacted Files	Coverage Δ
deepmd/fit/ener.py	`94.63% <0.00%> (ø)`
source/lib/tests/test_simulation_region.cc
source/lib/tests/test_fmt_nlist.cc
source/api_cc/tests/test_deeppot_model_devi.cc
source/lib/tests/test_tabulate.cc
...ource/lib/tests/test_soft_min_switch_force_grad.cc
source/lib/tests/test_coord.cc
source/lib/tests/test_env_mat_a.cc
source/lib/tests/test_main.cc
source/lib/tests/test_soft_min_switch_virial.cc
... and 24 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c0874f0...28e16e1. Read the comment docs.

amcadmus · 2021-08-25T05:33:30Z

@pkulzy could you plz check it this PR works fine on ROCM? Thanks!

galeselee · 2021-08-25T17:19:58Z

OK. These changes have passed UTs. The specific acceleration is better than I need to wait until the cluster environment is better to measure

* speedup cuda kernel scan_nlist * fix no-pbc error

denghuilu added 2 commits August 25, 2021 09:10

speedup cuda kernel scan_nlist

c6bbb71

fix no-pbc error

28e16e1

denghuilu requested review from amcadmus, galeselee and iProzd August 25, 2021 03:38

denghuilu changed the title ~~Speedup scan~~ speedup scan_nlist kernel Aug 25, 2021

galeselee approved these changes Aug 25, 2021

View reviewed changes

amcadmus approved these changes Aug 25, 2021

View reviewed changes

amcadmus merged commit 602760e into deepmodeling:devel Aug 25, 2021

gzq942560379 pushed a commit to HPC-AI-Team/deepmd-kit that referenced this pull request Sep 2, 2021

speedup scan_nlist kernel (deepmodeling#1028)

5d028c4

* speedup cuda kernel scan_nlist * fix no-pbc error

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

speedup scan_nlist kernel #1028

speedup scan_nlist kernel #1028
amcadmus merged 2 commits intodeepmodeling:develfrom
denghuilu:speedup-scan

denghuilu commented Aug 25, 2021

Uh oh!

codecov-commenter commented Aug 25, 2021 •

edited

Loading

Uh oh!

amcadmus commented Aug 25, 2021

Uh oh!

galeselee commented Aug 25, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

denghuilu commented Aug 25, 2021

Uh oh!

codecov-commenter commented Aug 25, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

amcadmus commented Aug 25, 2021

Uh oh!

galeselee commented Aug 25, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

codecov-commenter commented Aug 25, 2021 •

edited

Loading