690 nan inf loss #719

mathpluscode · 2021-03-26T22:46:04Z

Description

After a long debugging, I found the final source of inf was due to a negative discrete variance (~1e-5) in LNCC loss function. This can happen is that, within a window, the voxels might have the same values thus the variance should be 0, then numerical issues causes (somehow) a negative value in the end.

The fix is to normalize the kernel weights directly so that we do not need to divide the kernel volume later. During debugging, the min variance is now ~-1e-11, althou still negative, it's around machine error now.

To be safe, we now clip the variance to 0 and add an EPS 1e-5 (EPS was 1e-7, but VoxelMorph adopts 1e-5 and if we consider mixed-precision later, 1e-5 will be safer).

Fixes #690

Type of change

What types of changes does your code introduce to DeepReg?

Please check the boxes that apply after submitting the pull request.

Bugfix (non-breaking change which fixes an issue)
Code style update (formatting, renaming)
Refactoring (no functional changes, no api changes)
Documentation Update (fix or improvement on the documentation)
New feature (non-breaking change which adds functionality)
Other (if none of the other choices apply)

Checklist

Please check the boxes that apply after submitting the pull request.

If you're unsure about any of them, don't hesitate to ask. We're here to help! This is
simply a reminder of what we are going to look for before merging your code.

I have
installed pre-commit
using pre-commit install and formatted all changed files. If you are not
certain, run pre-commit run --all-files.
My commits' message styles matches
our requested structure,
e.g. Issue #<issue number>: detailed message.
I have updated the
change log file
regarding my changes.
I have added tests that prove my fix is effective or that my feature works
I have added necessary documentation (if appropriate)

codecov · 2021-03-26T23:25:48Z

Codecov Report

Merging #719 (6ad3312) into main (c48b78b) will not change coverage.
The diff coverage is 100.00%.

@@            Coverage Diff            @@
##              main      #719   +/-   ##
=========================================
  Coverage   100.00%   100.00%           
=========================================
  Files           38        39    +1     
  Lines         2445      2443    -2     
=========================================
- Hits          2445      2443    -2

Impacted Files	Coverage Δ
deepreg/constant.py	`100.00% <100.00%> (ø)`
deepreg/dataset/loader/interface.py	`100.00% <100.00%> (ø)`
deepreg/loss/image.py	`100.00% <100.00%> (ø)`
deepreg/loss/label.py	`100.00% <100.00%> (ø)`
deepreg/loss/util.py	`100.00% <100.00%> (ø)`
deepreg/model/network.py	`100.00% <100.00%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c48b78b...6ad3312. Read the comment docs.

YipengHu

@mathpluscode can you double-check if there is anywhere multiple separable kernels was used and normalised afterwards - which should be redundant now.

mathpluscode · 2021-03-27T14:22:51Z

I've been testing the new code overnight and no more inf/crash ^^
So I consider the bug to be fixed.

YipengHu · 2021-03-27T15:49:35Z

for record:
Project-MONAI/MONAI#1868

…nf-loss

YipengHu · 2021-03-28T10:03:18Z

@mathpluscode have you tested if the 2-pass algorithm produces negative var?

mathpluscode · 2021-03-28T10:13:01Z

@mathpluscode have you tested if the 2-pass algorithm produces negative var?

It won't as you do conv on the square of tensors, and conv having all positive weights. So by design, this will be >= 0.

I've tested the current implementation for multiple epochs and it seems to be safe.

YipengHu · 2021-03-28T10:14:19Z

@mathpluscode have you tested if the 2-pass algorithm produces negative var?

It won't as you do conv on the square of tensors, and conv having all positive weights. So by design, this will be >= 0.

I've tested the current implementation for multiple epochs and it seems to be safe.

so it won;t have the inf issue

mathpluscode · 2021-03-28T10:47:44Z

@mathpluscode have you tested if the 2-pass algorithm produces negative var?

It won't as you do conv on the square of tensors, and conv having all positive weights. So by design, this will be >= 0.
I've tested the current implementation for multiple epochs and it seems to be safe.

so it won;t have the inf issue

It should not.

YipengHu · 2021-03-31T10:26:48Z

@acasamitjana can you have a look at this please?

acasamitjana

I've seen some issues with negative variance in tensorflow. Good catch and nice solution @mathpluscode

mathpluscode added 30 commits March 21, 2021 20:22

Issue #690: add debug prints

c106bd3

Issue #609: hardcode update_freq for tensorboard

fac630b

do not use zero boundary

a667dd6

Merge branch '708-log-more-metrics-in-tensorboard' into 690-nan-inf-loss

80e2ac8

Issue #690: modify lncc implementation

e0f73c7

Issue #690: add EPS to numerator for LNCC

b141b3c

Issue #690: add debug print

952ae52

Issue #690: use tf.debugging.enable_check_numerics

9f997d1

Issue #690: attempt to skip some ops

8ff1898

Issue #690: check input image shapes

1330286

Merge branch '708-log-more-metrics-in-tensorboard' into 690-nan-inf-loss

ef40478

Issue #690: add debug check

e2354ff

Issue #690: fix op name

f31f532

Issue #690: add check into graph

0ad5f56

Issue #690: add check into lncc graph

40be7c3

Issue #690: add more checks into lncc

e034571

Issue #690: assert denom >= 0

eebfbdd

Issue #690: assert based on nan

4dc9a08

Issue #690: do proper debugging

b2f6338

Issue #690: fix typo

e0219d8

Issue #690: add assert into graph

90aa434

Issue #690: revert change to LNCC and increase EPS

6c27061

Issue #690: add additional metrics

de96cad

Issue #690: remove arg

64e6561

Issue #690: fit eagerly

15631da

Issue #690: print min max and save array

1cbf23c

Issue #690: catch err

5ac34c1

add print into grapj

c40cb34

Issue #690: hack ncc

06b76b0

Issue #690: do not compile eagerly

259bbd9

mathpluscode added 3 commits March 26, 2021 22:58

Merge remote-tracking branch 'origin/main' into 690-nan-inf-loss

8de5810

Issue #690: fix tests and remove debug changes

3a09c82

Issue #690: change code so that no need to add test

c8d62ae

mathpluscode mentioned this pull request Mar 26, 2021

Potential INF in LNCC loss Project-MONAI/MONAI#1868

Closed

Issue #690: fix pylint

493300b

mathpluscode requested review from YipengHu and acasamitjana March 26, 2021 23:45

Issue #690: update changelog and fix test

c129e58

YipengHu approved these changes Mar 27, 2021

View reviewed changes

YipengHu added 2 commits March 27, 2021 02:08

Merge branch 'main' into 690-nan-inf-loss

2b7c72b

Merge branch 'main' into 690-nan-inf-loss

adc2304

mathpluscode mentioned this pull request Mar 27, 2021

Uniform 1d kernels #723

Closed

mathpluscode and others added 4 commits March 27, 2021 17:11

Issue #690: modify LNCC implementation to be more stable

c226933

Merge branch 'main' into 690-nan-inf-loss

2756bff

Issue #690: add reference for LNCC

ac3a46d

Merge remote-tracking branch 'origin/690-nan-inf-loss' into 690-nan-i…

27b1936

…nf-loss

Merge branch 'main' into 690-nan-inf-loss

6ad3312

YipengHu approved these changes Mar 28, 2021

View reviewed changes

acasamitjana approved these changes Mar 31, 2021

View reviewed changes

YipengHu merged commit e47c569 into main Mar 31, 2021

YipengHu deleted the 690-nan-inf-loss branch March 31, 2021 10:43

mathpluscode mentioned this pull request Mar 31, 2021

Revert "690 nan inf loss" #732

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

690 nan inf loss #719

690 nan inf loss #719

Uh oh!

mathpluscode commented Mar 26, 2021 •

edited

Loading

Uh oh!

codecov bot commented Mar 26, 2021 •

edited

Loading

Uh oh!

YipengHu left a comment

Uh oh!

mathpluscode commented Mar 27, 2021 •

edited

Loading

Uh oh!

YipengHu commented Mar 27, 2021

Uh oh!

YipengHu commented Mar 28, 2021

Uh oh!

mathpluscode commented Mar 28, 2021 •

edited

Loading

Uh oh!

YipengHu commented Mar 28, 2021

Uh oh!

mathpluscode commented Mar 28, 2021

Uh oh!

YipengHu commented Mar 31, 2021

Uh oh!

acasamitjana left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

690 nan inf loss #719

690 nan inf loss #719

Uh oh!

Conversation

mathpluscode commented Mar 26, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of change

Checklist

Uh oh!

codecov bot commented Mar 26, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

YipengHu left a comment

Choose a reason for hiding this comment

Uh oh!

mathpluscode commented Mar 27, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

YipengHu commented Mar 27, 2021

Uh oh!

YipengHu commented Mar 28, 2021

Uh oh!

mathpluscode commented Mar 28, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

YipengHu commented Mar 28, 2021

Uh oh!

mathpluscode commented Mar 28, 2021

Uh oh!

YipengHu commented Mar 31, 2021

Uh oh!

acasamitjana left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

mathpluscode commented Mar 26, 2021 •

edited

Loading

codecov bot commented Mar 26, 2021 •

edited

Loading

mathpluscode commented Mar 27, 2021 •

edited

Loading

mathpluscode commented Mar 28, 2021 •

edited

Loading