Skip to content

merge devel into multi-task#931

Merged
njzjz merged 21 commits intomulti-taskfrom
devel
Aug 8, 2021
Merged

merge devel into multi-task#931
njzjz merged 21 commits intomulti-taskfrom
devel

Conversation

@njzjz
Copy link
Member

@njzjz njzjz commented Aug 8, 2021

So #929 can be reviewed by us.

njzjz and others added 21 commits July 13, 2021 13:49
* also convert input v1 to v2 for dp compress

* correct the dump file name
* Update prod_env_mat.cu

* Update prod_env_mat.cu
CUDA 11 has included cub, so we don't need to include it any more.
cub can be upgraded along with the CUDA Toolkit to obtain bugfixes.
See https://github.com/NVIDIA/cub#releases.
See also https://github.com/NVIDIA/cub/blob/main/CHANGELOG.md.

P.S. It's still unclear to me that which version of C++ it supports.
From the changelog, maybe C++ 17 is supported in CUDA 11.2? (#755)
#855)

* Synchronize CUDA _r modifications to ROCM

* fix bug 824 and Synchronize updates to CUDA code

bug 824 Fixed it in ROCM because of a bug caused by an array going out of bounds

* Update prod_env_mat.hip.cu

* Add Errcheck after every kernel function runs And merge redundant code

* Get rid of duplicate definitions of DPErrcheck

Co-authored-by: 李泽宇 <li_zeyu@pku.edu.cn>
* Update neighbor_stat.py

add safe check and warning for atoms with no neighbor

* Update neighbor_stat.py

use log.warning instead of log.warn
* adapt changes to auditwheel directory in manylinux

See pypa/manylinux#1143.

* find auditwheel path via `auditwheel --version`

* use custom image instead

* Update .github/workflows/build_wheel.yml
Update the compiling arguments in c++ interface example.
* build low and high precision at the same time

We can only provide one package containing both precisions.
BREAKING CHANGES:
Python: Python package will build both precisions, and DP_FLOAT_PREC
is now runtime envrionmental variables
C++: CMake will build both library, which will be called something like
libdeepmd and libdeepmd_low
LAMMPS: generate two directory USER-DEEPMD and USER-DEEPMD_low
ipi: generate two execuate dp_ipi and dp_ipi_low

* fix LAMMPS build script

* fix lammps cmake file

* install LIB_DEEPMD_OP_VARIANT

* remove FLOAT_PREC argument

* change DP_FLOAT_PREC to DP_INTERFACE_PREC

* revert some libraries as they do not need to build twice

* update error message

* change the implementation of LAMMPS variant

now `env.sh` and `env_low.sh` will be generated in the same directory.
Users can easily `mv env_low.sh env.sh` if they need low precision.
* Replace PS-Worker mode with multi-worker one.

* Remove deprecated `try_distrib` argument in tests.

* Limit reference of mpi4py to logger.py.

* Add tutorial on parallel training.

* Refine words & tokens used.

* Only limit sub sessions to CPU when distributed training.

* Add description of `mpi4py` in tutorial.

* Explain linear relationship between batch size and learning rate.

* Fine documents & comments.

* Let TensorFlow choose device when CUDA_VISIBLE_DEVICES is unset.

Co-authored-by: Han Wang <amcadmus@gmail.com>
…rix (#900)

* fix `InvalidArgumentError` caused by zero `sel`

Fix #899. See comments in the code for details.

* directly return zero matrix for exclude_types

* also optimize for se_r
* enhance the cli to generate doc json file

* bump dargs version; add argument to tests

* correct the type hint of `out_type`
* Find available GPUs in an elegant way.

* Clean codes of preparing parallel context.

* Fix code style and typo.

* Use a subprocess to detect GPU.

* Use Popen as a context manager.

* Do not use `tf.test.built_with_gpu_support`.

Co-authored-by: Han Wang <amcadmus@gmail.com>
Currently I can't see any warning during the training if sel is not enough,
so it's a good idea to check it before training, and tell the user what to do.

Also fix #874.
…r parallel training. (#913)

* Add unit tests of `cluster` and `env`.

* Fix the expanding logic of `SLURM_JOB_NODELIST`.
Starting from v2.0.0.b4, `libdeepmd` and `lammps-dp` will decouple.
The idea is that both C++ API and LAMMPS are usually stable, so we do not need to build LAMMPS in every release. Also, CPU version and GPU version share the same API and LAMMPS itself does not need CUDA, so we do not need to build LAMMPS twice.
Co-authored-by: Han Wang <wang_han@iapcm.ac.cn>
* Passing error to TF instead of exit

This commit does three little things:
(1) create an exception called `deepmd::deepmd_exception` (based on `std::runtime_error`);
(2) throw this exception instead of `exit` or `std::runtime_error`;
(3) catch this exception in the op, and pass to TF using `OP_REQUIRES_OK`.
One more, the OOM error will raise ResourceExhausted, as the same as TF ops.

The benifit of doing so is that the TF side and Python side can processing other things, catch the error, and print the traceback.
This commit can also fix #802, where the Python didn't save the buffer to the file before exit.

* define try catch function

* replace std::runtime_error

* add headers

* clean useless line

* add custom_op.cc to api_cc tests and rename save_compute to safe_compute
* add lammps compute style for deep tensor

* support the choice of floating point precision

* update doc for deeptensor/atom

Co-authored-by: Han Wang <wang_han@iapcm.ac.cn>
@codecov-commenter
Copy link

codecov-commenter commented Aug 8, 2021

Codecov Report

Merging #931 (b30a75e) into multi-task (cd3c920) will increase coverage by 1.53%.
The diff coverage is 64.02%.

Impacted file tree graph

@@              Coverage Diff               @@
##           multi-task     #931      +/-   ##
==============================================
+ Coverage       73.88%   75.41%   +1.53%     
==============================================
  Files              85       85              
  Lines            6805     6729      -76     
==============================================
+ Hits             5028     5075      +47     
+ Misses           1777     1654     -123     
Impacted Files Coverage Δ
deepmd/entrypoints/__init__.py 100.00% <ø> (ø)
deepmd/utils/compat.py 83.06% <ø> (-0.27%) ⬇️
source/lib/include/neighbor_list.h 100.00% <ø> (ø)
deepmd/entrypoints/doc.py 33.33% <28.57%> (-26.67%) ⬇️
deepmd/descriptor/se_r.py 92.85% <33.33%> (ø)
deepmd/loggers/loggers.py 41.90% <33.33%> (-0.67%) ⬇️
deepmd/entrypoints/compress.py 92.50% <50.00%> (ø)
deepmd/utils/neighbor_stat.py 93.75% <50.00%> (-6.25%) ⬇️
deepmd/train/trainer.py 71.06% <56.89%> (+3.18%) ⬆️
deepmd/train/run_options.py 72.81% <60.00%> (+20.46%) ⬆️
... and 11 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update cd3c920...b30a75e. Read the comment docs.

@njzjz njzjz changed the title merge devel to multi-task merge devel into multi-task Aug 8, 2021
@njzjz njzjz merged commit 67c6cc0 into multi-task Aug 8, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants