Passing error to TF instead of exit by njzjz · Pull Request #918 · deepmodeling/deepmd-kit

njzjz · 2021-08-04T04:06:52Z

This commit does three little things:
(1) create an exception called deepmd::deepmd_exception (based on std::runtime_error);
(2) throw this exception instead of exit or std::runtime_error;
(3) catch this exception in the op, and pass to TF using OP_REQUIRES_OK.
One more, the OOM error will raise ResourceExhausted, as the same as TF ops.

The benifit of doing so is that the TF side and Python side can processing other things, catch the error, and print the traceback.
This commit can also fix #802, where the Python didn't save the buffer to the file before exit.

The idea was inspired by TF: https://github.com/tensorflow/tensorflow/blob/5dcfc51118817f27fad5246812d83e5dccdc5f72/tensorflow/core/kernels/mkl/mkl_tfconv_op.h#L120-L126

This commit does three little things: (1) create an exception called `deepmd::deepmd_exception` (based on `std::runtime_error`); (2) throw this exception instead of `exit` or `std::runtime_error`; (3) catch this exception in the op, and pass to TF using `OP_REQUIRES_OK`. One more, the OOM error will raise ResourceExhausted, as the same as TF ops. The benifit of doing so is that the TF side and Python side can processing other things, catch the error, and print the traceback. This commit can also fix deepmodeling#802, where the Python didn't save the buffer to the file before exit.

codecov-commenter · 2021-08-04T04:10:26Z

Codecov Report

Merging #918 (01cad22) into devel (ee0ed99) will increase coverage by 7.87%.
The diff coverage is n/a.

@@            Coverage Diff             @@
##            devel     #918      +/-   ##
==========================================
+ Coverage   75.42%   83.29%   +7.87%     
==========================================
  Files          85      118      +33     
  Lines        6730     9916    +3186     
==========================================
+ Hits         5076     8260    +3184     
- Misses       1654     1656       +2

Impacted Files	Coverage Δ
source/lib/tests/test_pair_tab.cc	`100.00% <0.00%> (ø)`
source/lib/tests/test_prod_virial_grad_r.cc	`100.00% <0.00%> (ø)`
source/lib/tests/test_tabulate.cc	`100.00% <0.00%> (ø)`
source/lib/tests/test_env_mat_r.cc	`100.00% <0.00%> (ø)`
source/api_cc/tests/test_utils.h	`97.22% <0.00%> (ø)`
source/api_cc/tests/test_deeppot_model_devi.cc	`100.00% <0.00%> (ø)`
source/lib/tests/test_soft_min_switch_force.cc	`100.00% <0.00%> (ø)`
source/lib/tests/test_env_mat_a.cc	`100.00% <0.00%> (ø)`
source/api_cc/tests/test_ewald.cc	`100.00% <0.00%> (ø)`
source/lib/tests/test_soft_min_switch_virial.cc	`100.00% <0.00%> (ø)`
... and 23 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ee0ed99...01cad22. Read the comment docs.

amcadmus

We should avoid copying the try ... catch ... in all OP implementations.

Shall we define some function like

#define void ComputeFunction(OpKernelContext* context);

void save_compute(OpKernelContext* context, const ComputeFunction &ff) {
  try{
    ff(context);
  } catch (...) {
    ....
  } catch (...) {
    ....
  }
}

where ComputeFunction is the implementation of the OP.

njzjz · 2021-08-04T09:27:43Z

See

deepmd-kit/source/op/custom_op.cc

Lines 4 to 20 in 9c8a0da

    
           namespace deepmd { 
        
             void save_compute(OpKernelContext* context, std::function<void(OpKernelContext*)> ff) { 
        
               try{ 
        
                 ff(context); 
        
               } catch (deepmd::deepmd_exception_oom& e){ 
        
                 OP_REQUIRES_OK( 
        
                     context, 
        
                     errors::ResourceExhausted("Operation received an exception: ", e.what(), 
        
                                     ", in file ",__FILE__, ":", __LINE__)); 
        
               } catch (deepmd::deepmd_exception& e) { 
        
                 OP_REQUIRES_OK( 
        
                     context, 
        
                     errors::Internal("Operation received an exception: ", e.what(), 
        
                                     ", in file ",__FILE__, ":", __LINE__)); 
        
               } 
        
             } 
        
           };

and

deepmd-kit/source/op/descrpt.cc

Lines 52 to 56 in 9c8a0da

    
           void Compute(OpKernelContext* context) override { 
        
             deepmd::save_compute(context, [this](OpKernelContext* context) {this->_Compute(context);}); 
        
           } 
        
           void _Compute(OpKernelContext* context) {

amcadmus · 2021-08-04T10:58:28Z

deepmd-kit/source/op/custom_op.cc

Line 5 in 9c8a0da

void save_compute(OpKernelContext* context, std::function<void(OpKernelContext*)> ff) {

The name should have been safe_compute, sry my mistake

galeselee

There is no problem on ROCm platform about UT and compiling

* Passing error to TF instead of exit This commit does three little things: (1) create an exception called `deepmd::deepmd_exception` (based on `std::runtime_error`); (2) throw this exception instead of `exit` or `std::runtime_error`; (3) catch this exception in the op, and pass to TF using `OP_REQUIRES_OK`. One more, the OOM error will raise ResourceExhausted, as the same as TF ops. The benifit of doing so is that the TF side and Python side can processing other things, catch the error, and print the traceback. This commit can also fix deepmodeling#802, where the Python didn't save the buffer to the file before exit. * define try catch function * replace std::runtime_error * add headers * clean useless line * add custom_op.cc to api_cc tests and rename save_compute to safe_compute

…successfully (deepmodeling#918) Signed-off-by: Chenqqian Zhang <100290172+Chengqian-Zhang@users.noreply.github.com> Now there is a problem, if someone wants to use zbl potential, the energy table will not be copied into the task folder with input.json, I added a new parameter "srtab_file_path" , the user needs to write the path of the energy table, when generating the training task, this file will be copied into the task folder. Signed-off-by: Chenqqian Zhang <100290172+Chengqian-Zhang@users.noreply.github.com>

njzjz requested review from amcadmus, denghuilu, galeselee and iProzd August 4, 2021 04:06

njzjz linked an issue Aug 4, 2021 that may be closed by this pull request

[BUG] dp as ASE calculator, no output until the whole script is finished #802

Closed

amcadmus requested changes Aug 4, 2021

View reviewed changes

njzjz added 4 commits August 4, 2021 05:21

define try catch function

46ee28b

replace std::runtime_error

3b57006

add headers

e70d780

clean useless line

9c8a0da

add custom_op.cc to api_cc tests and rename save_compute to safe_compute

01cad22

njzjz force-pushed the handle_error branch from 300f5a8 to 01cad22 Compare August 4, 2021 19:48

amcadmus approved these changes Aug 4, 2021

View reviewed changes

denghuilu approved these changes Aug 6, 2021

View reviewed changes

galeselee approved these changes Aug 6, 2021

View reviewed changes

amcadmus merged commit 4be5eb1 into deepmodeling:devel Aug 6, 2021

njzjz deleted the handle_error branch August 6, 2021 17:13

njzjz mentioned this pull request Aug 22, 2021

give a clear message if model.get_ntypes()<data.get_ntypes() #1016

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Passing error to TF instead of exit#918

Passing error to TF instead of exit#918
amcadmus merged 6 commits intodeepmodeling:develfrom
njzjz:handle_error

njzjz commented Aug 4, 2021 •

edited

Loading

Uh oh!

codecov-commenter commented Aug 4, 2021 •

edited

Loading

Uh oh!

amcadmus left a comment

Uh oh!

njzjz commented Aug 4, 2021

Uh oh!

amcadmus commented Aug 4, 2021

Uh oh!

galeselee left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

njzjz commented Aug 4, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov-commenter commented Aug 4, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

amcadmus left a comment

Choose a reason for hiding this comment

Uh oh!

njzjz commented Aug 4, 2021

Uh oh!

amcadmus commented Aug 4, 2021

Uh oh!

galeselee left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

njzjz commented Aug 4, 2021 •

edited

Loading

codecov-commenter commented Aug 4, 2021 •

edited

Loading