Skip to content

Passing error to TF instead of exit#918

Merged
amcadmus merged 6 commits intodeepmodeling:develfrom
njzjz:handle_error
Aug 6, 2021
Merged

Passing error to TF instead of exit#918
amcadmus merged 6 commits intodeepmodeling:develfrom
njzjz:handle_error

Conversation

@njzjz
Copy link
Member

@njzjz njzjz commented Aug 4, 2021

This commit does three little things:
(1) create an exception called deepmd::deepmd_exception (based on std::runtime_error);
(2) throw this exception instead of exit or std::runtime_error;
(3) catch this exception in the op, and pass to TF using OP_REQUIRES_OK.
One more, the OOM error will raise ResourceExhausted, as the same as TF ops.

The benifit of doing so is that the TF side and Python side can processing other things, catch the error, and print the traceback.
This commit can also fix #802, where the Python didn't save the buffer to the file before exit.

The idea was inspired by TF: https://github.com/tensorflow/tensorflow/blob/5dcfc51118817f27fad5246812d83e5dccdc5f72/tensorflow/core/kernels/mkl/mkl_tfconv_op.h#L120-L126

This commit does three little things:
(1) create an exception called `deepmd::deepmd_exception` (based on `std::runtime_error`);
(2) throw this exception instead of `exit` or `std::runtime_error`;
(3) catch this exception in the op, and pass to TF using `OP_REQUIRES_OK`.
One more, the OOM error will raise ResourceExhausted, as the same as TF ops.

The benifit of doing so is that the TF side and Python side can processing other things, catch the error, and print the traceback.
This commit can also fix deepmodeling#802, where the Python didn't save the buffer to the file before exit.
@codecov-commenter
Copy link

codecov-commenter commented Aug 4, 2021

Codecov Report

Merging #918 (01cad22) into devel (ee0ed99) will increase coverage by 7.87%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##            devel     #918      +/-   ##
==========================================
+ Coverage   75.42%   83.29%   +7.87%     
==========================================
  Files          85      118      +33     
  Lines        6730     9916    +3186     
==========================================
+ Hits         5076     8260    +3184     
- Misses       1654     1656       +2     
Impacted Files Coverage Δ
source/lib/tests/test_pair_tab.cc 100.00% <0.00%> (ø)
source/lib/tests/test_prod_virial_grad_r.cc 100.00% <0.00%> (ø)
source/lib/tests/test_tabulate.cc 100.00% <0.00%> (ø)
source/lib/tests/test_env_mat_r.cc 100.00% <0.00%> (ø)
source/api_cc/tests/test_utils.h 97.22% <0.00%> (ø)
source/api_cc/tests/test_deeppot_model_devi.cc 100.00% <0.00%> (ø)
source/lib/tests/test_soft_min_switch_force.cc 100.00% <0.00%> (ø)
source/lib/tests/test_env_mat_a.cc 100.00% <0.00%> (ø)
source/api_cc/tests/test_ewald.cc 100.00% <0.00%> (ø)
source/lib/tests/test_soft_min_switch_virial.cc 100.00% <0.00%> (ø)
... and 23 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ee0ed99...01cad22. Read the comment docs.

Copy link
Member

@amcadmus amcadmus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should avoid copying the try ... catch ... in all OP implementations.

Shall we define some function like

#define void ComputeFunction(OpKernelContext* context);

void save_compute(OpKernelContext* context, const ComputeFunction &ff) {
  try{
    ff(context);
  } catch (...) {
    ....
  } catch (...) {
    ....
  }
}

where ComputeFunction is the implementation of the OP.

@njzjz
Copy link
Member Author

njzjz commented Aug 4, 2021

See

namespace deepmd {
void save_compute(OpKernelContext* context, std::function<void(OpKernelContext*)> ff) {
try{
ff(context);
} catch (deepmd::deepmd_exception_oom& e){
OP_REQUIRES_OK(
context,
errors::ResourceExhausted("Operation received an exception: ", e.what(),
", in file ",__FILE__, ":", __LINE__));
} catch (deepmd::deepmd_exception& e) {
OP_REQUIRES_OK(
context,
errors::Internal("Operation received an exception: ", e.what(),
", in file ",__FILE__, ":", __LINE__));
}
}
};
and
void Compute(OpKernelContext* context) override {
deepmd::save_compute(context, [this](OpKernelContext* context) {this->_Compute(context);});
}
void _Compute(OpKernelContext* context) {

@amcadmus
Copy link
Member

amcadmus commented Aug 4, 2021

void save_compute(OpKernelContext* context, std::function<void(OpKernelContext*)> ff) {

The name should have been safe_compute, sry my mistake

Copy link
Contributor

@galeselee galeselee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no problem on ROCm platform about UT and compiling

@amcadmus amcadmus merged commit 4be5eb1 into deepmodeling:devel Aug 6, 2021
@njzjz njzjz deleted the handle_error branch August 6, 2021 17:13
gzq942560379 pushed a commit to HPC-AI-Team/deepmd-kit that referenced this pull request Sep 2, 2021
* Passing error to TF instead of exit

This commit does three little things:
(1) create an exception called `deepmd::deepmd_exception` (based on `std::runtime_error`);
(2) throw this exception instead of `exit` or `std::runtime_error`;
(3) catch this exception in the op, and pass to TF using `OP_REQUIRES_OK`.
One more, the OOM error will raise ResourceExhausted, as the same as TF ops.

The benifit of doing so is that the TF side and Python side can processing other things, catch the error, and print the traceback.
This commit can also fix deepmodeling#802, where the Python didn't save the buffer to the file before exit.

* define try catch function

* replace std::runtime_error

* add headers

* clean useless line

* add custom_op.cc to api_cc tests and rename save_compute to safe_compute
njzjz pushed a commit to njzjz/deepmd-kit that referenced this pull request Sep 21, 2023
…successfully (deepmodeling#918)

Signed-off-by: Chenqqian Zhang
<100290172+Chengqian-Zhang@users.noreply.github.com>
Now there is a problem, if someone wants to use zbl potential, the
energy table will not be copied into the task folder with input.json, I
added a new parameter "srtab_file_path" , the user needs to write the
path of the energy table, when generating the training task, this file
will be copied into the task folder.

Signed-off-by: Chenqqian Zhang <100290172+Chengqian-Zhang@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] dp as ASE calculator, no output until the whole script is finished

5 participants