Skip to content

Conversation

@mathpluscode
Copy link
Member

@mathpluscode mathpluscode commented Mar 26, 2021

Description

After a long debugging, I found the final source of inf was due to a negative discrete variance (~1e-5) in LNCC loss function. This can happen is that, within a window, the voxels might have the same values thus the variance should be 0, then numerical issues causes (somehow) a negative value in the end.

The fix is to normalize the kernel weights directly so that we do not need to divide the kernel volume later. During debugging, the min variance is now ~-1e-11, althou still negative, it's around machine error now.

To be safe, we now clip the variance to 0 and add an EPS 1e-5 (EPS was 1e-7, but VoxelMorph adopts 1e-5 and if we consider mixed-precision later, 1e-5 will be safer).

Fixes #690

Type of change

What types of changes does your code introduce to DeepReg?

Please check the boxes that apply after submitting the pull request.

  • Bugfix (non-breaking change which fixes an issue)
  • Code style update (formatting, renaming)
  • Refactoring (no functional changes, no api changes)
  • Documentation Update (fix or improvement on the documentation)
  • New feature (non-breaking change which adds functionality)
  • Other (if none of the other choices apply)

Checklist

Please check the boxes that apply after submitting the pull request.

If you're unsure about any of them, don't hesitate to ask. We're here to help! This is
simply a reminder of what we are going to look for before merging your code.

  • I have
    installed pre-commit
    using pre-commit install and formatted all changed files. If you are not
    certain, run pre-commit run --all-files.
  • My commits' message styles matches
    our requested structure,
    e.g. Issue #<issue number>: detailed message.
  • I have updated the
    change log file
    regarding my changes.
  • I have added tests that prove my fix is effective or that my feature works
  • I have added necessary documentation (if appropriate)

@codecov
Copy link

codecov bot commented Mar 26, 2021

Codecov Report

Merging #719 (6ad3312) into main (c48b78b) will not change coverage.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff            @@
##              main      #719   +/-   ##
=========================================
  Coverage   100.00%   100.00%           
=========================================
  Files           38        39    +1     
  Lines         2445      2443    -2     
=========================================
- Hits          2445      2443    -2     
Impacted Files Coverage Δ
deepreg/constant.py 100.00% <100.00%> (ø)
deepreg/dataset/loader/interface.py 100.00% <100.00%> (ø)
deepreg/loss/image.py 100.00% <100.00%> (ø)
deepreg/loss/label.py 100.00% <100.00%> (ø)
deepreg/loss/util.py 100.00% <100.00%> (ø)
deepreg/model/network.py 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c48b78b...6ad3312. Read the comment docs.

Copy link
Member

@YipengHu YipengHu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mathpluscode can you double-check if there is anywhere multiple separable kernels was used and normalised afterwards - which should be redundant now.

@mathpluscode
Copy link
Member Author

mathpluscode commented Mar 27, 2021

I've been testing the new code overnight and no more inf/crash ^^
So I consider the bug to be fixed.

@mathpluscode mathpluscode mentioned this pull request Mar 27, 2021
@YipengHu
Copy link
Member

for record:
Project-MONAI/MONAI#1868

@YipengHu
Copy link
Member

@mathpluscode have you tested if the 2-pass algorithm produces negative var?

@mathpluscode
Copy link
Member Author

mathpluscode commented Mar 28, 2021

@mathpluscode have you tested if the 2-pass algorithm produces negative var?

It won't as you do conv on the square of tensors, and conv having all positive weights. So by design, this will be >= 0.

I've tested the current implementation for multiple epochs and it seems to be safe.

@YipengHu
Copy link
Member

@mathpluscode have you tested if the 2-pass algorithm produces negative var?

It won't as you do conv on the square of tensors, and conv having all positive weights. So by design, this will be >= 0.

I've tested the current implementation for multiple epochs and it seems to be safe.

so it won;t have the inf issue

@mathpluscode
Copy link
Member Author

@mathpluscode have you tested if the 2-pass algorithm produces negative var?

It won't as you do conv on the square of tensors, and conv having all positive weights. So by design, this will be >= 0.
I've tested the current implementation for multiple epochs and it seems to be safe.

so it won;t have the inf issue

It should not.

@YipengHu
Copy link
Member

@acasamitjana can you have a look at this please?

Copy link
Member

@acasamitjana acasamitjana left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've seen some issues with negative variance in tensorflow. Good catch and nice solution @mathpluscode

@YipengHu YipengHu merged commit e47c569 into main Mar 31, 2021
@YipengHu YipengHu deleted the 690-nan-inf-loss branch March 31, 2021 10:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Nan/inf loss encountered

4 participants