Debugging nan gradients: what am I doing wrong?

thanrl · March 2, 2024, 7:54am

I see nan gradients in my model parameters.

I set torch.autograd.set_detect_anomaly(True) and it points to that Function 'DivBackward0' returned nan values in its 1th output on this line

div = x / scale

So I try to print the nan gradient by doing

x.requires_grad_()
print(f"x: {x.isnan().any().item(), x.isinf().any().item()}", flush=True)
x.register_hook(lambda grad: print(f"x: {grad.isnan().any().item(), grad.isinf().any().item()}", flush=True))
        
scale.requires_grad_()
print(f"scale: {scale.isnan().any().item(), scale.isinf().any().item()}", flush=True)
scale.register_hook(lambda grad: print(f"scale: {grad.isnan().any().item(), grad.isinf().any().item()}", flush=True))
        
div = x / scale

div.requires_grad_()
print(f"div: {div.isnan().any().item(), div.isinf().any().item()}", flush=True)
div.register_hook(lambda grad: print(f"div: {grad.isnan().any().item(), grad.isinf().any().item()}", flush=True))

But all I got was False. What am I doing wrong?

soulitzer · March 9, 2024, 6:27pm

Could you share the full script where this is happening?

thanrl · March 9, 2024, 8:50pm

I have not found the exact cause of the nans, but the question about debugging I have figured it out.

To see the nans printed, I should have registered the hooks scale.register_hook(print) without setting torch.autograd.set_detect_anomaly(True). Otherwise, the anomaly detection would stop the program before the backward hook is called.