Skip to content

Devel update#17

Merged
iProzd merged 18 commits intoiProzd:develfrom
deepmodeling:devel
Aug 9, 2021
Merged

Devel update#17
iProzd merged 18 commits intoiProzd:develfrom
deepmodeling:devel

Conversation

@iProzd
Copy link
Owner

@iProzd iProzd commented Aug 9, 2021

No description provided.

njzjz and others added 18 commits July 28, 2021 20:36
* adapt changes to auditwheel directory in manylinux

See pypa/manylinux#1143.

* find auditwheel path via `auditwheel --version`

* use custom image instead

* Update .github/workflows/build_wheel.yml
Update the compiling arguments in c++ interface example.
* build low and high precision at the same time

We can only provide one package containing both precisions.
BREAKING CHANGES:
Python: Python package will build both precisions, and DP_FLOAT_PREC
is now runtime envrionmental variables
C++: CMake will build both library, which will be called something like
libdeepmd and libdeepmd_low
LAMMPS: generate two directory USER-DEEPMD and USER-DEEPMD_low
ipi: generate two execuate dp_ipi and dp_ipi_low

* fix LAMMPS build script

* fix lammps cmake file

* install LIB_DEEPMD_OP_VARIANT

* remove FLOAT_PREC argument

* change DP_FLOAT_PREC to DP_INTERFACE_PREC

* revert some libraries as they do not need to build twice

* update error message

* change the implementation of LAMMPS variant

now `env.sh` and `env_low.sh` will be generated in the same directory.
Users can easily `mv env_low.sh env.sh` if they need low precision.
* Replace PS-Worker mode with multi-worker one.

* Remove deprecated `try_distrib` argument in tests.

* Limit reference of mpi4py to logger.py.

* Add tutorial on parallel training.

* Refine words & tokens used.

* Only limit sub sessions to CPU when distributed training.

* Add description of `mpi4py` in tutorial.

* Explain linear relationship between batch size and learning rate.

* Fine documents & comments.

* Let TensorFlow choose device when CUDA_VISIBLE_DEVICES is unset.

Co-authored-by: Han Wang <amcadmus@gmail.com>
…rix (#900)

* fix `InvalidArgumentError` caused by zero `sel`

Fix #899. See comments in the code for details.

* directly return zero matrix for exclude_types

* also optimize for se_r
* enhance the cli to generate doc json file

* bump dargs version; add argument to tests

* correct the type hint of `out_type`
* Find available GPUs in an elegant way.

* Clean codes of preparing parallel context.

* Fix code style and typo.

* Use a subprocess to detect GPU.

* Use Popen as a context manager.

* Do not use `tf.test.built_with_gpu_support`.

Co-authored-by: Han Wang <amcadmus@gmail.com>
Currently I can't see any warning during the training if sel is not enough,
so it's a good idea to check it before training, and tell the user what to do.

Also fix #874.
…r parallel training. (#913)

* Add unit tests of `cluster` and `env`.

* Fix the expanding logic of `SLURM_JOB_NODELIST`.
Starting from v2.0.0.b4, `libdeepmd` and `lammps-dp` will decouple.
The idea is that both C++ API and LAMMPS are usually stable, so we do not need to build LAMMPS in every release. Also, CPU version and GPU version share the same API and LAMMPS itself does not need CUDA, so we do not need to build LAMMPS twice.
Co-authored-by: Han Wang <wang_han@iapcm.ac.cn>
* Passing error to TF instead of exit

This commit does three little things:
(1) create an exception called `deepmd::deepmd_exception` (based on `std::runtime_error`);
(2) throw this exception instead of `exit` or `std::runtime_error`;
(3) catch this exception in the op, and pass to TF using `OP_REQUIRES_OK`.
One more, the OOM error will raise ResourceExhausted, as the same as TF ops.

The benifit of doing so is that the TF side and Python side can processing other things, catch the error, and print the traceback.
This commit can also fix #802, where the Python didn't save the buffer to the file before exit.

* define try catch function

* replace std::runtime_error

* add headers

* clean useless line

* add custom_op.cc to api_cc tests and rename save_compute to safe_compute
* add lammps compute style for deep tensor

* support the choice of floating point precision

* update doc for deeptensor/atom

Co-authored-by: Han Wang <wang_han@iapcm.ac.cn>
* add aliases to Arguments

Fix #846.

move for cherry-pick

add aliases to Arguments (#932)

fix #846.

move back

* fix typo
* remove dependences on training script and data from model compression

* reset function update_one_sel in train.py

* update the doc of model compression

* fix bug in UT

* optimize code for reviewer's comments

* undo changes to constant variables

* Update common.py

* update code structure of DPTrainer

* fix lint warnings in common.py

* fix duplicated lines within trainer.py

* Update trainer.py

* rm default values with False optional in argcheck.py
Co-authored-by: Han Wang <wang_han@iapcm.ac.cn>
@iProzd iProzd merged commit 7072344 into iProzd:devel Aug 9, 2021
iProzd pushed a commit that referenced this pull request Sep 18, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants