Loss becoming nan after some time

I have a total_ loss which is sum of -

  1. A BCELoss
  2. A Crossentropy loss
  3. A custom loss function for image gradient.

The problem I am facing is that after 1st batch, some weights are updated to nan which results in all outputs as nan. If I remove the gradient loss, then it works fine.
What I found out was the denominator in the gradient loss were becoming 0, which was causing the problem. To fix it, I replace all denominators( Gradient magnitude) which were 0 with 1. But doing this only computes numerical loss for first backprop, next results in nan.

code-

sobel_x = torch.tensor([[+1, 0, -1], [+2, 0, -2], [+1, 0, -1]], requires_grad=False,dtype = torch.float)
	sobel_y = torch.tensor([[+1, +2, +1], [0, 0, 0], [-1, -2, -1]], requires_grad=False,dtype = torch.float)
	if cuda:
		sobel_x,sobel_y = sobel_x.cuda(),sobel_y.cuda()
		boundary_mask = boundary_mask.cuda()
	sobel_x = sobel_x.view((1,1,3,3))
	sobel_y = sobel_y.view((1,1,3,3))
	
	
	#gradients in the x and y direction for both predictions and the target transparencies
	G_x_pred = F.conv2d(pred,sobel_x,padding = 1)
	G_y_pred = F.conv2d(pred,sobel_y,padding = 1)
	G_x_target = F.conv2d(target,sobel_x,padding = 1)
	G_y_target = F.conv2d(target,sobel_y,padding = 1)

	#magnitudes of the gradients
	M_pred = torch.sqrt(torch.pow(G_x_pred,2)+torch.pow(G_y_pred,2))
	M_target = torch.sqrt(torch.pow(G_x_target,2)+torch.pow(G_y_target,2))
	
	#taking care of nans
	M_pred = (M_pred==0.).float() + M_pred
	M_target = (M_target==0.).float() + M_target

	# Lcos = (1-v_pred*v_target)*Magnitude_pred
	Lcos = (1-torch.abs((G_x_pred/M_pred)*(G_x_target/M_target)+(G_y_pred/M_pred)*(G_y_target/M_target)))*M_pred
	
	#Lmag = max(lambda*M_target-M_pred,0)
	lambd = 1.5
	Lmag = lambd*M_target-M_pred
	Lmag[Lmag<0] = 0

	gamma_1 = 0.5
	gamma_2 = 0.5

	#total gradient loss
	image_gradient_loss = (gamma_1*Lcos+gamma_2*Lmag)*boundary_mask

Can someone please help in tackling this situation.

I just checked, the gradients are becoming nan and the refine loss is 0. due to nan gradient, weights are nan and hence results in nan output. Why are gradients ‘nan’ when using refine loss because when I don’t use it, the gradients are fine.

In the denominator add epsilon, to prevent the denominator from becoming zero and gradient clipping can also help.

add epsilon in torch.sqrt