merge devel into multi-task by njzjz · Pull Request #931 · deepmodeling/deepmd-kit

njzjz · 2021-08-08T20:51:51Z

So #929 can be reviewed by us.

* also convert input v1 to v2 for dp compress * correct the dump file name

* Update prod_env_mat.cu * Update prod_env_mat.cu

fix #847. See also sphinx-doc/sphinx#9433.

CUDA 11 has included cub, so we don't need to include it any more. cub can be upgraded along with the CUDA Toolkit to obtain bugfixes. See https://github.com/NVIDIA/cub#releases. See also https://github.com/NVIDIA/cub/blob/main/CHANGELOG.md. P.S. It's still unclear to me that which version of C++ it supports. From the changelog, maybe C++ 17 is supported in CUDA 11.2? (#755)

#855) * Synchronize CUDA _r modifications to ROCM * fix bug 824 and Synchronize updates to CUDA code bug 824 Fixed it in ROCM because of a bug caused by an array going out of bounds * Update prod_env_mat.hip.cu * Add Errcheck after every kernel function runs And merge redundant code * Get rid of duplicate definitions of DPErrcheck Co-authored-by: 李泽宇 <li_zeyu@pku.edu.cn>

* Update neighbor_stat.py add safe check and warning for atoms with no neighbor * Update neighbor_stat.py use log.warning instead of log.warn

* adapt changes to auditwheel directory in manylinux See pypa/manylinux#1143. * find auditwheel path via `auditwheel --version` * use custom image instead * Update .github/workflows/build_wheel.yml

Update the compiling arguments in c++ interface example.

* build low and high precision at the same time We can only provide one package containing both precisions. BREAKING CHANGES: Python: Python package will build both precisions, and DP_FLOAT_PREC is now runtime envrionmental variables C++: CMake will build both library, which will be called something like libdeepmd and libdeepmd_low LAMMPS: generate two directory USER-DEEPMD and USER-DEEPMD_low ipi: generate two execuate dp_ipi and dp_ipi_low * fix LAMMPS build script * fix lammps cmake file * install LIB_DEEPMD_OP_VARIANT * remove FLOAT_PREC argument * change DP_FLOAT_PREC to DP_INTERFACE_PREC * revert some libraries as they do not need to build twice * update error message * change the implementation of LAMMPS variant now `env.sh` and `env_low.sh` will be generated in the same directory. Users can easily `mv env_low.sh env.sh` if they need low precision.

* Replace PS-Worker mode with multi-worker one. * Remove deprecated `try_distrib` argument in tests. * Limit reference of mpi4py to logger.py. * Add tutorial on parallel training. * Refine words & tokens used. * Only limit sub sessions to CPU when distributed training. * Add description of `mpi4py` in tutorial. * Explain linear relationship between batch size and learning rate. * Fine documents & comments. * Let TensorFlow choose device when CUDA_VISIBLE_DEVICES is unset. Co-authored-by: Han Wang <amcadmus@gmail.com>

…rix (#900) * fix `InvalidArgumentError` caused by zero `sel` Fix #899. See comments in the code for details. * directly return zero matrix for exclude_types * also optimize for se_r

* enhance the cli to generate doc json file * bump dargs version; add argument to tests * correct the type hint of `out_type`

* Find available GPUs in an elegant way. * Clean codes of preparing parallel context. * Fix code style and typo. * Use a subprocess to detect GPU. * Use Popen as a context manager. * Do not use `tf.test.built_with_gpu_support`. Co-authored-by: Han Wang <amcadmus@gmail.com>

The default value of `type_map` is `None`, so when you don't set `type_map`, you'll get this error. https://github.com/deepmodeling/deepmd-kit/blob/043ac869bfcdc7f3a20aa24d04bb7c7b88abcc0b/deepmd/entrypoints/train.py#L225

Currently I can't see any warning during the training if sel is not enough, so it's a good idea to check it before training, and tell the user what to do. Also fix #874.

…r parallel training. (#913) * Add unit tests of `cluster` and `env`. * Fix the expanding logic of `SLURM_JOB_NODELIST`.

Starting from v2.0.0.b4, `libdeepmd` and `lammps-dp` will decouple. The idea is that both C++ API and LAMMPS are usually stable, so we do not need to build LAMMPS in every release. Also, CPU version and GPU version share the same API and LAMMPS itself does not need CUDA, so we do not need to build LAMMPS twice.

Co-authored-by: Han Wang <wang_han@iapcm.ac.cn>

* Passing error to TF instead of exit This commit does three little things: (1) create an exception called `deepmd::deepmd_exception` (based on `std::runtime_error`); (2) throw this exception instead of `exit` or `std::runtime_error`; (3) catch this exception in the op, and pass to TF using `OP_REQUIRES_OK`. One more, the OOM error will raise ResourceExhausted, as the same as TF ops. The benifit of doing so is that the TF side and Python side can processing other things, catch the error, and print the traceback. This commit can also fix #802, where the Python didn't save the buffer to the file before exit. * define try catch function * replace std::runtime_error * add headers * clean useless line * add custom_op.cc to api_cc tests and rename save_compute to safe_compute

* add lammps compute style for deep tensor * support the choice of floating point precision * update doc for deeptensor/atom Co-authored-by: Han Wang <wang_han@iapcm.ac.cn>

codecov-commenter · 2021-08-08T20:55:17Z

Codecov Report

Merging #931 (b30a75e) into multi-task (cd3c920) will increase coverage by 1.53%.
The diff coverage is 64.02%.

@@              Coverage Diff               @@
##           multi-task     #931      +/-   ##
==============================================
+ Coverage       73.88%   75.41%   +1.53%     
==============================================
  Files              85       85              
  Lines            6805     6729      -76     
==============================================
+ Hits             5028     5075      +47     
+ Misses           1777     1654     -123

Impacted Files	Coverage Δ
deepmd/entrypoints/__init__.py	`100.00% <ø> (ø)`
deepmd/utils/compat.py	`83.06% <ø> (-0.27%)`	⬇️
source/lib/include/neighbor_list.h	`100.00% <ø> (ø)`
deepmd/entrypoints/doc.py	`33.33% <28.57%> (-26.67%)`	⬇️
deepmd/descriptor/se_r.py	`92.85% <33.33%> (ø)`
deepmd/loggers/loggers.py	`41.90% <33.33%> (-0.67%)`	⬇️
deepmd/entrypoints/compress.py	`92.50% <50.00%> (ø)`
deepmd/utils/neighbor_stat.py	`93.75% <50.00%> (-6.25%)`	⬇️
deepmd/train/trainer.py	`71.06% <56.89%> (+3.18%)`	⬆️
deepmd/train/run_options.py	`72.81% <60.00%> (+20.46%)`	⬆️
... and 11 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update cd3c920...b30a75e. Read the comment docs.

njzjz and others added 21 commits July 13, 2021 13:49

also convert input v1 to v2 for dp compress (#844)

94635ba

* also convert input v1 to v2 for dp compress * correct the dump file name

Synchronize format_nlist_b in CUDA with ROCm (#845)

78123a5

* Update prod_env_mat.cu * Update prod_env_mat.cu

pin sphinx to a previous version (#848)

1f01650

fix #847. See also sphinx-doc/sphinx#9433.

Fix the empty neighbor distance array in neighbor_stat.py (#882)

851fa96

* Update neighbor_stat.py add safe check and warning for atoms with no neighbor * Update neighbor_stat.py use log.warning instead of log.warn

adapt changes to auditwheel directory in manylinux (#889)

c4b9c9e

* adapt changes to auditwheel directory in manylinux See pypa/manylinux#1143. * find auditwheel path via `auditwheel --version` * use custom image instead * Update .github/workflows/build_wheel.yml

Update getting-started.md (#898)

953621f

Update the compiling arguments in c++ interface example.

fix InvalidArgumentError caused by zero sel and optimize zero mat…

70508a5

…rix (#900) * fix `InvalidArgumentError` caused by zero `sel` Fix #899. See comments in the code for details. * directly return zero matrix for exclude_types * also optimize for se_r

enhance the cli to generate doc json file (#891)

043ac86

* enhance the cli to generate doc json file * bump dargs version; add argument to tests * correct the type hint of `out_type`

fix 'NoneType' has no len() in auto_sel (#911)

b5b15fa

The default value of `type_map` is `None`, so when you don't set `type_map`, you'll get this error. https://github.com/deepmodeling/deepmd-kit/blob/043ac869bfcdc7f3a20aa24d04bb7c7b88abcc0b/deepmd/entrypoints/train.py#L225

raise warning before training if sel is not enough (#914)

689ffa4

Currently I can't see any warning during the training if sel is not enough, so it's a good idea to check it before training, and tell the user what to do. Also fix #874.

Fix the expanding logic of SLURM_JOB_NODELIST and add unit tests fo…

ee0ed99

…r parallel training. (#913) * Add unit tests of `cluster` and `env`. * Fix the expanding logic of `SLURM_JOB_NODELIST`.

Fix member declartion of deepmd and deepmd.entrypoints. (#922)

3ae80b3

set input DeepmdData.type_map to input type_map (#924)

4ced020

Co-authored-by: Han Wang <wang_han@iapcm.ac.cn>

add lammps compute style for atomic deep tensor (#927)

b30a75e

* add lammps compute style for deep tensor * support the choice of floating point precision * update doc for deeptensor/atom Co-authored-by: Han Wang <wang_han@iapcm.ac.cn>

njzjz changed the title ~~merge devel to multi-task~~ merge devel into multi-task Aug 8, 2021

njzjz merged commit 67c6cc0 into multi-task Aug 8, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

merge devel into multi-task#931

merge devel into multi-task#931
njzjz merged 21 commits intomulti-taskfrom
devel

njzjz commented Aug 8, 2021

Uh oh!

codecov-commenter commented Aug 8, 2021 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

Conversation

njzjz commented Aug 8, 2021

Uh oh!

codecov-commenter commented Aug 8, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

codecov-commenter commented Aug 8, 2021 •

edited

Loading