Passing error to TF instead of exit#918
Conversation
This commit does three little things: (1) create an exception called `deepmd::deepmd_exception` (based on `std::runtime_error`); (2) throw this exception instead of `exit` or `std::runtime_error`; (3) catch this exception in the op, and pass to TF using `OP_REQUIRES_OK`. One more, the OOM error will raise ResourceExhausted, as the same as TF ops. The benifit of doing so is that the TF side and Python side can processing other things, catch the error, and print the traceback. This commit can also fix deepmodeling#802, where the Python didn't save the buffer to the file before exit.
Codecov Report
@@ Coverage Diff @@
## devel #918 +/- ##
==========================================
+ Coverage 75.42% 83.29% +7.87%
==========================================
Files 85 118 +33
Lines 6730 9916 +3186
==========================================
+ Hits 5076 8260 +3184
- Misses 1654 1656 +2
Continue to review full report at Codecov.
|
amcadmus
left a comment
There was a problem hiding this comment.
We should avoid copying the try ... catch ... in all OP implementations.
Shall we define some function like
#define void ComputeFunction(OpKernelContext* context);
void save_compute(OpKernelContext* context, const ComputeFunction &ff) {
try{
ff(context);
} catch (...) {
....
} catch (...) {
....
}
}where ComputeFunction is the implementation of the OP.
|
See deepmd-kit/source/op/custom_op.cc Lines 4 to 20 in 9c8a0da deepmd-kit/source/op/descrpt.cc Lines 52 to 56 in 9c8a0da |
The name should have been |
galeselee
left a comment
There was a problem hiding this comment.
There is no problem on ROCm platform about UT and compiling
* Passing error to TF instead of exit This commit does three little things: (1) create an exception called `deepmd::deepmd_exception` (based on `std::runtime_error`); (2) throw this exception instead of `exit` or `std::runtime_error`; (3) catch this exception in the op, and pass to TF using `OP_REQUIRES_OK`. One more, the OOM error will raise ResourceExhausted, as the same as TF ops. The benifit of doing so is that the TF side and Python side can processing other things, catch the error, and print the traceback. This commit can also fix deepmodeling#802, where the Python didn't save the buffer to the file before exit. * define try catch function * replace std::runtime_error * add headers * clean useless line * add custom_op.cc to api_cc tests and rename save_compute to safe_compute
…successfully (deepmodeling#918) Signed-off-by: Chenqqian Zhang <100290172+Chengqian-Zhang@users.noreply.github.com> Now there is a problem, if someone wants to use zbl potential, the energy table will not be copied into the task folder with input.json, I added a new parameter "srtab_file_path" , the user needs to write the path of the energy table, when generating the training task, this file will be copied into the task folder. Signed-off-by: Chenqqian Zhang <100290172+Chengqian-Zhang@users.noreply.github.com>
This commit does three little things:
(1) create an exception called
deepmd::deepmd_exception(based onstd::runtime_error);(2) throw this exception instead of
exitorstd::runtime_error;(3) catch this exception in the op, and pass to TF using
OP_REQUIRES_OK.One more, the OOM error will raise ResourceExhausted, as the same as TF ops.
The benifit of doing so is that the TF side and Python side can processing other things, catch the error, and print the traceback.
This commit can also fix #802, where the Python didn't save the buffer to the file before exit.
The idea was inspired by TF: https://github.com/tensorflow/tensorflow/blob/5dcfc51118817f27fad5246812d83e5dccdc5f72/tensorflow/core/kernels/mkl/mkl_tfconv_op.h#L120-L126