When I was training my Neural Network, I found that the weights of my model became Nan after some steps. Then, according to solutions in other topics, I check the loss and gradients of the former step using `torch.isfinite(param.grad).all()`

. I found out that the gradient of all layers in my model became inf, while the loss is still small (at least much smaller than beginning).

This is how I define my loss function:

```
class GLoss(nn.Module):
def __init__(self,len):
super().__init__()
self.L1_loss= nn.SmoothL1Loss()
self.len_pred = len
self.double()
def forward(self,predict,label):#input of iou3d should be [x,y,z,w,h,l,theta]
loss = torch.tensor(0.0)
len_gt = len(label)
for i in range(self.len_pred):
#block start
box=predict[i]
keep = label[0][0]
best_score = iou3D(box[:-1].detach(),keep.detach())
for j in range(1,len_gt):
score = iou3D(box[:-1].detach(),label[0][j].detach())
if score > best_score:
best_score=score
keep=label[0][j]
#block end
#The above block is used to choose a proper training target, I don't think that it will affact loss computation.
loss+=self.loss_calculate(box,keep,best_score)
return loss/128
def loss_calculate(self,pred,gt,score):
loss=torch.tensor(0.0)
position = pred[:3]
tar_pos = gt[:3]
L1L=self.L1_loss(position,tar_pos)
loss += L1L
Bbox= pred[3:6]
Bbox_t= gt[3:6]
BBL=torch.sum(torch.abs(1-Bbox/Bbox_t))
loss += BBL
angle=pred[6]
angle_t=gt[6]
AGL=torch.sqrt(2*(1-torch.cos(angle-angle_t)))
loss += AGL
if score>0.5:
cls=1
else:
cls=0
x=torch.sigmoid(pred[-1])
CLL=-(cls*torch.log(x)+(1-cls)*torch.log(1-x))
loss += CLL
return loss
```

My model is a little large, so I only paste the top block of my model.

```
transformer.blocks.5.ln1.weight tensor(False)
transformer.blocks.5.ln1.bias tensor(False)
transformer.blocks.5.ln2.weight tensor(False)
transformer.blocks.5.ln2.bias tensor(False)
transformer.blocks.5.attn.qkv.weight tensor(False)
transformer.blocks.5.attn.qkv.bias tensor(False)
transformer.blocks.5.attn.proj.weight tensor(False)
transformer.blocks.5.attn.proj.bias tensor(False)
transformer.blocks.5.mlp.0.weight tensor(False)
transformer.blocks.5.mlp.0.bias tensor(False)
transformer.blocks.5.mlp.2.weight tensor(False)
transformer.blocks.5.mlp.2.bias tensor(False)
transformer.norm.weight tensor(False)
transformer.norm.bias tensor(False)
transformer.head.fc.weight tensor(False)
transformer.head.fc.bias tensor(False)
```

This model starts from the last fc layer, while the gradient is inf in this layer.

And the loss value is:

`tensor(4.0644, grad_fn=<DivBackward0>)`

From my former experience, I suppose that gradient exploding problem should happens in some middle layers of the model, but not in the input. And the loss should be very large when gradient exploding happens. It is not the case in my problem. Therefore, I wonder if there is anything wrong with my loss function?