-
Notifications
You must be signed in to change notification settings - Fork 84
690 nan inf loss #719
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
690 nan inf loss #719
Conversation
Codecov Report
@@ Coverage Diff @@
## main #719 +/- ##
=========================================
Coverage 100.00% 100.00%
=========================================
Files 38 39 +1
Lines 2445 2443 -2
=========================================
- Hits 2445 2443 -2
Continue to review full report at Codecov.
|
YipengHu
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mathpluscode can you double-check if there is anywhere multiple separable kernels was used and normalised afterwards - which should be redundant now.
|
I've been testing the new code overnight and no more inf/crash ^^ |
|
for record: |
|
@mathpluscode have you tested if the 2-pass algorithm produces negative var? |
It won't as you do conv on the square of tensors, and conv having all positive weights. So by design, this will be >= 0. I've tested the current implementation for multiple epochs and it seems to be safe. |
so it won;t have the inf issue |
It should not. |
|
@acasamitjana can you have a look at this please? |
acasamitjana
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've seen some issues with negative variance in tensorflow. Good catch and nice solution @mathpluscode
Description
After a long debugging, I found the final source of inf was due to a negative discrete variance (~1e-5) in LNCC loss function. This can happen is that, within a window, the voxels might have the same values thus the variance should be 0, then numerical issues causes (somehow) a negative value in the end.
The fix is to normalize the kernel weights directly so that we do not need to divide the kernel volume later. During debugging, the min variance is now ~-1e-11, althou still negative, it's around machine error now.
To be safe, we now clip the variance to 0 and add an EPS 1e-5 (EPS was 1e-7, but VoxelMorph adopts 1e-5 and if we consider mixed-precision later, 1e-5 will be safer).
Fixes #690
Type of change
What types of changes does your code introduce to DeepReg?
Please check the boxes that apply after submitting the pull request.
Checklist
Please check the boxes that apply after submitting the pull request.
If you're unsure about any of them, don't hesitate to ask. We're here to help! This is
simply a reminder of what we are going to look for before merging your code.
installed pre-commit
using
pre-commit installand formatted all changed files. If you are notcertain, run
pre-commit run --all-files.our requested structure,
e.g.
Issue #<issue number>: detailed message.change log file
regarding my changes.