Merged
Conversation
* adapt changes to auditwheel directory in manylinux See pypa/manylinux#1143. * find auditwheel path via `auditwheel --version` * use custom image instead * Update .github/workflows/build_wheel.yml
Update the compiling arguments in c++ interface example.
* build low and high precision at the same time We can only provide one package containing both precisions. BREAKING CHANGES: Python: Python package will build both precisions, and DP_FLOAT_PREC is now runtime envrionmental variables C++: CMake will build both library, which will be called something like libdeepmd and libdeepmd_low LAMMPS: generate two directory USER-DEEPMD and USER-DEEPMD_low ipi: generate two execuate dp_ipi and dp_ipi_low * fix LAMMPS build script * fix lammps cmake file * install LIB_DEEPMD_OP_VARIANT * remove FLOAT_PREC argument * change DP_FLOAT_PREC to DP_INTERFACE_PREC * revert some libraries as they do not need to build twice * update error message * change the implementation of LAMMPS variant now `env.sh` and `env_low.sh` will be generated in the same directory. Users can easily `mv env_low.sh env.sh` if they need low precision.
* Replace PS-Worker mode with multi-worker one. * Remove deprecated `try_distrib` argument in tests. * Limit reference of mpi4py to logger.py. * Add tutorial on parallel training. * Refine words & tokens used. * Only limit sub sessions to CPU when distributed training. * Add description of `mpi4py` in tutorial. * Explain linear relationship between batch size and learning rate. * Fine documents & comments. * Let TensorFlow choose device when CUDA_VISIBLE_DEVICES is unset. Co-authored-by: Han Wang <amcadmus@gmail.com>
* enhance the cli to generate doc json file * bump dargs version; add argument to tests * correct the type hint of `out_type`
* Find available GPUs in an elegant way. * Clean codes of preparing parallel context. * Fix code style and typo. * Use a subprocess to detect GPU. * Use Popen as a context manager. * Do not use `tf.test.built_with_gpu_support`. Co-authored-by: Han Wang <amcadmus@gmail.com>
The default value of `type_map` is `None`, so when you don't set `type_map`, you'll get this error. https://github.com/deepmodeling/deepmd-kit/blob/043ac869bfcdc7f3a20aa24d04bb7c7b88abcc0b/deepmd/entrypoints/train.py#L225
Currently I can't see any warning during the training if sel is not enough, so it's a good idea to check it before training, and tell the user what to do. Also fix #874.
…r parallel training. (#913) * Add unit tests of `cluster` and `env`. * Fix the expanding logic of `SLURM_JOB_NODELIST`.
Starting from v2.0.0.b4, `libdeepmd` and `lammps-dp` will decouple. The idea is that both C++ API and LAMMPS are usually stable, so we do not need to build LAMMPS in every release. Also, CPU version and GPU version share the same API and LAMMPS itself does not need CUDA, so we do not need to build LAMMPS twice.
Co-authored-by: Han Wang <wang_han@iapcm.ac.cn>
* Passing error to TF instead of exit This commit does three little things: (1) create an exception called `deepmd::deepmd_exception` (based on `std::runtime_error`); (2) throw this exception instead of `exit` or `std::runtime_error`; (3) catch this exception in the op, and pass to TF using `OP_REQUIRES_OK`. One more, the OOM error will raise ResourceExhausted, as the same as TF ops. The benifit of doing so is that the TF side and Python side can processing other things, catch the error, and print the traceback. This commit can also fix #802, where the Python didn't save the buffer to the file before exit. * define try catch function * replace std::runtime_error * add headers * clean useless line * add custom_op.cc to api_cc tests and rename save_compute to safe_compute
* add lammps compute style for deep tensor * support the choice of floating point precision * update doc for deeptensor/atom Co-authored-by: Han Wang <wang_han@iapcm.ac.cn>
* remove dependences on training script and data from model compression * reset function update_one_sel in train.py * update the doc of model compression * fix bug in UT * optimize code for reviewer's comments * undo changes to constant variables * Update common.py * update code structure of DPTrainer * fix lint warnings in common.py * fix duplicated lines within trainer.py * Update trainer.py * rm default values with False optional in argcheck.py
Co-authored-by: Han Wang <wang_han@iapcm.ac.cn>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.