Why all gradients go inf when loss function is still small?

JTG · September 13, 2021, 7:57am

When I was training my Neural Network, I found that the weights of my model became Nan after some steps. Then, according to solutions in other topics, I check the loss and gradients of the former step using torch.isfinite(param.grad).all(). I found out that the gradient of all layers in my model became inf, while the loss is still small (at least much smaller than beginning).

This is how I define my loss function:

class GLoss(nn.Module):
    def __init__(self,len):
        super().__init__()
        self.L1_loss= nn.SmoothL1Loss()
        self.len_pred = len
        self.double()
    def forward(self,predict,label):#input of iou3d should be [x,y,z,w,h,l,theta]
        loss = torch.tensor(0.0)
        len_gt = len(label)
        for i in range(self.len_pred):
        
            #block start
            box=predict[i]
            keep = label[0][0]
            best_score = iou3D(box[:-1].detach(),keep.detach())
            for j in range(1,len_gt):
                score = iou3D(box[:-1].detach(),label[0][j].detach())
                if score > best_score:
                    best_score=score
                    keep=label[0][j]
            #block end
            #The above block is used to choose a proper training target, I don't think that it will affact loss computation.
            loss+=self.loss_calculate(box,keep,best_score)
        return loss/128
    def loss_calculate(self,pred,gt,score):
        loss=torch.tensor(0.0)
        position = pred[:3]
        tar_pos = gt[:3]
        L1L=self.L1_loss(position,tar_pos)
        loss += L1L
        Bbox= pred[3:6]
        Bbox_t= gt[3:6]
        BBL=torch.sum(torch.abs(1-Bbox/Bbox_t))
        loss += BBL
        angle=pred[6]
        angle_t=gt[6]
        AGL=torch.sqrt(2*(1-torch.cos(angle-angle_t)))
        loss += AGL
        if score>0.5:
            cls=1
        else:
            cls=0
        x=torch.sigmoid(pred[-1])
        CLL=-(cls*torch.log(x)+(1-cls)*torch.log(1-x))
        loss += CLL
        return loss

My model is a little large, so I only paste the top block of my model.

transformer.blocks.5.ln1.weight tensor(False)
transformer.blocks.5.ln1.bias tensor(False)
transformer.blocks.5.ln2.weight tensor(False)
transformer.blocks.5.ln2.bias tensor(False)
transformer.blocks.5.attn.qkv.weight tensor(False)
transformer.blocks.5.attn.qkv.bias tensor(False)
transformer.blocks.5.attn.proj.weight tensor(False)
transformer.blocks.5.attn.proj.bias tensor(False)
transformer.blocks.5.mlp.0.weight tensor(False)
transformer.blocks.5.mlp.0.bias tensor(False)
transformer.blocks.5.mlp.2.weight tensor(False)
transformer.blocks.5.mlp.2.bias tensor(False)
transformer.norm.weight tensor(False)
transformer.norm.bias tensor(False)
transformer.head.fc.weight tensor(False)
transformer.head.fc.bias tensor(False)

This model starts from the last fc layer, while the gradient is inf in this layer.

And the loss value is:

tensor(4.0644, grad_fn=<DivBackward0>)

From my former experience, I suppose that gradient exploding problem should happens in some middle layers of the model, but not in the input. And the loss should be very large when gradient exploding happens. It is not the case in my problem. Therefore, I wonder if there is anything wrong with my loss function?