@tumble-weed I’m unaware of such a NaN
catcher but I suspect that the Beta distribution may output samples from a ‘trivial’ distribution like the uniform distribution or something when it gets NaN as input. I suspect the NaNs arise in the Dice loss function, because I never got a NaN when I trained using the cross entropy loss. I have given the network code. I’ll try using hooks or retain_grad
. Could you point me to some places to look up such kinds of issues and debugging?
Edit: Tried using hooks to find out. Turns out that the NaNs are due to high values of gradients at some layers. For example, here is an example of an output by the hooks:
inp pconv4, 0, 1.60642921401e-05, -2.30273017223e-05
inp pconv4, 1, 0.1004197523, -0.0698678120971
inp pconv4, 2, -0.114721596241, -0.114721596241
out pconv4, 0, 1.54108820425e-05, -1.4961476154e-05
inp pconv3, 0, 2.00003178179e-05, -2.27984455705e-05
inp pconv3, 1, -3.40282346639e+38, 3.40282346639e+38
inp pconv3, 2, -3.40282346639e+38, 3.40282346639e+38
out pconv3, 0, 8.05776580819e-05, -0.000112374822493
inp tconv4, 0, 1.4272424778e-05, -2.27984455705e-05
inp tconv4, 1, -3.40282346639e+38, 3.40282346639e+38
out tconv4, 0, 2.00003178179e-05, -2.27984455705e-05
inp res10, 0, 1.41433047247e-05, -8.82773838384e-06
inp res10, 1, 1.41433047247e-05, -8.82773838384e-06
out res10, 0, 1.41433047247e-05, -8.82773838384e-06
inp res9, 0, 1.33553148771e-05, -9.24733012653e-06
inp res9, 1, 1.33553148771e-05, -9.24733012653e-06
out res9, 0, 1.33553148771e-05, -9.24733012653e-06
inp res8, 0, 1.54191238835e-05, -1.03389274955e-05
inp res8, 1, 1.54191238835e-05, -1.03389274955e-05
out res8, 0, 1.54191238835e-05, -1.03389274955e-05
inp pconv2, 0, 9.37501172302e-05, -0.000109870823508
inp pconv2, 1, -3.40282346639e+38, 3.40282346639e+38
inp pconv2, 2, -3.40282346639e+38, 3.40282346639e+38
out pconv2, 0, 0.000202267314307, -0.000283791217953
inp tconv3, 0, 3.52506276613e-06, -9.23275365494e-06
inp tconv3, 1, -3.40282346639e+38, 3.40282346639e+38
out tconv3, 0, 4.74067655887e-06, -9.23275365494e-06
inp res7, 0, 2.18382069761e-07, -1.71947704075e-07
inp res7, 1, 2.18382069761e-07, -1.71947704075e-07
out res7, 0, 2.18382069761e-07, -1.71947704075e-07
inp res6, 0, nan, nan
inp res6, 1, nan, nan
out res6, 0, nan, nan
inp res5, 0, nan, nan
inp res5, 1, nan, nan
out res5, 0, nan, nan
inp AvgPool3d(kernel_size=(1, 2, 2), stride=(1, 2, 2), padding=0), 9.59493546443e+14, -7.67205847335e+14
out AvgPool3d(kernel_size=(1, 2, 2), stride=(1, 2, 2), padding=0), nan, nan
inp pconv1, 0, 2.66912047664e-05, -2.00852118724e-05
inp pconv1, 1, -3.40282346639e+38, 3.40282346639e+38
inp pconv1, 2, -3.40282346639e+38, 3.40282346639e+38
out pconv1, 0, 4.03012127208e-05, -6.12536241533e-05
inp tconv2, 0, 9.59493546443e+14, -5.2268440014e+14
inp tconv2, 1, -3.40282346639e+38, 3.40282346639e+38
out tconv2, 0, 9.59493546443e+14, -7.67205847335e+14
inp res4, 0, 3.21442163982e+14, -2.90692882498e+14
inp res4, 1, 3.21442163982e+14, -2.90692882498e+14
out res4, 0, 3.21442163982e+14, -2.90692882498e+14
inp res3, 0, 4.12539561181e+14, -3.18285765673e+14
inp res3, 1, 4.12539561181e+14, -3.18285765673e+14
out res3, 0, 4.12539561181e+14, -3.18285765673e+14
inp AvgPool3d(kernel_size=(1, 2, 2), stride=(1, 2, 2), padding=0), nan, nan
out AvgPool3d(kernel_size=(1, 2, 2), stride=(1, 2, 2), padding=0), 5.69794319352e+14, -5.25037538902e+14
inp tconv1, 0, nan, nan
inp tconv1, 1, -3.40282346639e+38, 3.40282346639e+38
out tconv1, 0, nan, nan
inp res2, 0, nan, nan
inp res2, 1, nan, nan
out res2, 0, nan, nan
inp conv(
(layer): Sequential(
(0): Conv3d(32, 32, kernel_size=(1, 3, 3), stride=(1, 1, 1), padding=(0, 1, 1))
(1): BatchNorm3d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(2): PReLU(num_parameters=1)
)
), nan, nan
inp conv(
(layer): Sequential(
(0): Conv3d(32, 32, kernel_size=(1, 3, 3), stride=(1, 1, 1), padding=(0, 1, 1))
(1): BatchNorm3d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(2): PReLU(num_parameters=1)
)
), -3.40282346639e+38, 3.40282346639e+38
out conv(
(layer): Sequential(
(0): Conv3d(32, 32, kernel_size=(1, 3, 3), stride=(1, 1, 1), padding=(0, 1, 1))
(1): BatchNorm3d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(2): PReLU(num_parameters=1)
)
), nan, nan
inp conv(
(layer): Sequential(
(0): Conv3d(32, 32, kernel_size=(1, 3, 3), stride=(1, 1, 1), padding=(0, 1, 1))
(1): BatchNorm3d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(2): PReLU(num_parameters=1)
)
), nan, nan
inp conv(
(layer): Sequential(
(0): Conv3d(32, 32, kernel_size=(1, 3, 3), stride=(1, 1, 1), padding=(0, 1, 1))
(1): BatchNorm3d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(2): PReLU(num_parameters=1)
)
), -3.40282346639e+38, 3.40282346639e+38
out conv(
(layer): Sequential(
(0): Conv3d(32, 32, kernel_size=(1, 3, 3), stride=(1, 1, 1), padding=(0, 1, 1))
(1): BatchNorm3d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(2): PReLU(num_parameters=1)
)
), nan, nan
inp res1, 0, nan, nan
inp res1, 1, nan, nan
out res1, 0, nan, nan
inp conv(
(layer): Sequential(
(0): Conv3d(32, 32, kernel_size=(1, 3, 3), stride=(1, 1, 1), padding=(0, 1, 1))
(1): BatchNorm3d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(2): PReLU(num_parameters=1)
)
), nan, nan
inp conv(
(layer): Sequential(
(0): Conv3d(32, 32, kernel_size=(1, 3, 3), stride=(1, 1, 1), padding=(0, 1, 1))
(1): BatchNorm3d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(2): PReLU(num_parameters=1)
)
), -3.40282346639e+38, 3.40282346639e+38
out conv(
(layer): Sequential(
(0): Conv3d(32, 32, kernel_size=(1, 3, 3), stride=(1, 1, 1), padding=(0, 1, 1))
(1): BatchNorm3d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(2): PReLU(num_parameters=1)
)
), nan, nan
inp conv(
(layer): Sequential(
(0): Conv3d(32, 32, kernel_size=(1, 3, 3), stride=(1, 1, 1), padding=(0, 1, 1))
(1): BatchNorm3d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(2): PReLU(num_parameters=1)
)
), nan, nan
inp conv(
(layer): Sequential(
(0): Conv3d(32, 32, kernel_size=(1, 3, 3), stride=(1, 1, 1), padding=(0, 1, 1))
(1): BatchNorm3d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(2): PReLU(num_parameters=1)
)
), -3.40282346639e+38, 3.40282346639e+38
out conv(
(layer): Sequential(
(0): Conv3d(32, 32, kernel_size=(1, 3, 3), stride=(1, 1, 1), padding=(0, 1, 1))
(1): BatchNorm3d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(2): PReLU(num_parameters=1)
)
), nan, nan
inp conv1, 0, nan, nan
inp conv1, 1, -3.40282346639e+38, 3.40282346639e+38
out conv1, 0, nan, nan
There are a few places where the gradients’ max and min come out to be very high, and then the nan
s follow. I’m using residual nets, Batchnorm, and PReLU to prevent gradient explosion.