I’ll just share my observations here. I am also seeing this NaN gradient issue in my training code. I tried to capture the inputs to CTC when gradients are NaN using backward hooks. The code is like
class CTCWrapper(nn.Module):
def __init__(self):
super().__init__()
self.ctc = nn.CTCLoss(blank=0, reduction='none', zero_infinity=True)
def forward(self, logits, logit_lengths, labels, label_lengths):
def _backward_hook(grad):
if not np.isfinite(grad.sum().item()):
# save logits, logit_lengths, labels, label_lengths and grad
torch.save(...)
log_probs = F.log_softmax(logits, 2)
log_probs.register_hook(_backward_hook)
return self.ctc(log_probs, labels, logit_lengths, label_lengths)
The variables are dumped successfully when NaN gradient is detected. The dumped grad
variable (I’ll call it dump_grad
) has NaN values while others (logits
, labels
) don’t. However, when I tried those inputs on CTCWrapper alone, the backward gradients (I’ll call it offline_grad
) have no NaN values.
I further looked into dump_grad
and offline_grad
, they are only different where dump_grad
is NaN. dump_grad
is a [396, 93, 96] tensor (T,N,C layout). It has four NaNs in four different samples, all of them at the last time step and in the blank label, i.e. only dump_grad[-1,:,0],
has NaN.
I don’t know where to go further so I have to stop here and use a workaround, like setting NaNs to zeros.
I’m using PyTorch 1.1 stable and CUDA 10.0.
The values that contain NaNs are attached. I can upload the dump if anyone is interested.
>> dump_grad[-1,:,0]
tensor([-1.1732e-05, 1.8528e-04, 1.8528e-04, 1.8528e-04, 1.8528e-04,
1.8528e-04, 1.8528e-04, 1.8528e-04, 1.8528e-04, 1.8528e-04,
1.8528e-04, 1.8528e-04, 1.8528e-04, 1.8528e-04, 1.8528e-04,
1.8528e-04, 1.8528e-04, 1.8528e-04, 1.8528e-04, 1.8528e-04,
1.8528e-04, 1.8528e-04, 1.8528e-04, 1.8528e-04, 1.8528e-04,
1.8528e-04, 1.8528e-04, 1.8528e-04, 1.8528e-04, 1.8528e-04,
1.8528e-04, 1.8528e-04, 1.8528e-04, 1.8528e-04, 1.8528e-04,
1.8528e-04, 1.8528e-04, 1.8528e-04, 1.8528e-04, 1.8528e-04,
1.8528e-04, 1.8528e-04, nan, 1.8528e-04, 1.8528e-04,
1.8528e-04, nan, nan, nan, 1.8528e-04,
1.8528e-04, 1.8528e-04, 1.8528e-04, 1.8528e-04, 1.8528e-04,
1.8528e-04, 1.8528e-04, 1.8528e-04, 1.8528e-04, 1.8528e-04,
1.8528e-04, 1.8528e-04, 1.8528e-04, 1.8528e-04, 1.8528e-04,
1.8528e-04, 1.8528e-04, 1.8528e-04, 1.8528e-04, 1.8528e-04,
1.8528e-04, 1.8528e-04, 1.8528e-04, 1.8528e-04, 1.8528e-04,
1.8528e-04, 1.8528e-04, 1.8528e-04, 1.8528e-04, 1.8528e-04,
1.8528e-04, 1.8528e-04, 1.8528e-04, 1.8528e-04, 1.8528e-04,
1.8528e-04, 1.8528e-04, 1.8528e-04, 1.8528e-04, 1.8528e-04,
1.8528e-04, 1.8528e-04, 1.8528e-04], device='cuda:0')
>> offline_grad[-1,:,0]
tensor([-1.1732e-05, 1.8528e-04, 1.8528e-04, 1.8528e-04, 1.8528e-04,
1.8528e-04, 1.8528e-04, 1.8528e-04, 1.8528e-04, 1.8528e-04,
1.8528e-04, 1.8528e-04, 1.8528e-04, 1.8528e-04, 1.8528e-04,
1.8528e-04, 1.8528e-04, 1.8528e-04, 1.8528e-04, 1.8528e-04,
1.8528e-04, 1.8528e-04, 1.8528e-04, 1.8528e-04, 1.8528e-04,
1.8528e-04, 1.8528e-04, 1.8528e-04, 1.8528e-04, 1.8528e-04,
1.8528e-04, 1.8528e-04, 1.8528e-04, 1.8528e-04, 1.8528e-04,
1.8528e-04, 1.8528e-04, 1.8528e-04, 1.8528e-04, 1.8528e-04,
1.8528e-04, 1.8528e-04, 1.8528e-04, 1.8528e-04, 1.8528e-04,
1.8528e-04, 1.8528e-04, 1.8528e-04, 1.8528e-04, 1.8528e-04,
1.8528e-04, 1.8528e-04, 1.8528e-04, 1.8528e-04, 1.8528e-04,
1.8528e-04, 1.8528e-04, 1.8528e-04, 1.8528e-04, 1.8528e-04,
1.8528e-04, 1.8528e-04, 1.8528e-04, 1.8528e-04, 1.8528e-04,
1.8528e-04, 1.8528e-04, 1.8528e-04, 1.8528e-04, 1.8528e-04,
1.8528e-04, 1.8528e-04, 1.8528e-04, 1.8528e-04, 1.8528e-04,
1.8528e-04, 1.8528e-04, 1.8528e-04, 1.8528e-04, 1.8528e-04,
1.8528e-04, 1.8528e-04, 1.8528e-04, 1.8528e-04, 1.8528e-04,
1.8528e-04, 1.8528e-04, 1.8528e-04, 1.8528e-04, 1.8528e-04,
1.8528e-04, 1.8528e-04, 1.8528e-04], device='cuda:0')