Why the cross-entropy loss list passed to the autograd.backward() causes nan?

Hui_Lin · January 9, 2019, 4:11am

problem now: I blocked the Cross entropy loss and used only mse,then fixed the previous problem .
So the cross entropy loss is mostly root of the previous problem.
But who can explain why the cross-entropy loss list passed to the autograd.backward() causes nan?
thx!

problem solved:
I used to lower the learning rate, but still get nan. and I found that the first place where nan appeared was layer2.0.conv1.weights (or sometimes layer4.2.bn3.bias) in resnet, and only the last element of weights was nan, the input and loss of this iteration was not nan.

but after this iteration, all weights and losses get nan.

the code below is based on ：GitHub - natanielruiz/deep-head-pose: 🔥🔥 Deep Learning Head Pose Estimation using PyTorch.
He added 3 fully connected layers on resnet, and I took 27.

How to fix this problem? thx guys.

"""traing code"""

    for epoch in range(num_epochs):
        for i, (images, labels, cont_labels, pic_names) in enumerate(train_loader):
            images = Variable(images).cuda(gpu)
            label_sh = Variable(labels).cuda(gpu) # Binned labels
            label_sh_cont = Variable(cont_labels).cuda(gpu) # Continuous labels
            sh = model(images)  # prediction
            loss_seq = []
            for j in range(num_coefs):
                # **Cross entropy loss**
                loss_sh = criterion(sh[:, j], label_sh[:, j]) 
                # MSE loss
                sh_predicted = softmax(sh[:, j])
                sh_predicted = torch.sum(sh_predicted * idx_tensor, 1) * (2. / num_bins) - 1
                loss_reg_sh = reg_criterion(sh_predicted, label_sh_cont[:, j])
                # Total loss
                loss_sh += alpha * loss_reg_sh
                loss_seq.append(loss_sh)

            grad_seq = [torch.ones(1).cuda(gpu) for _ in range(len(loss_seq))] 
            optimizer.zero_grad()
            torch.autograd.backward(loss_seq, grad_seq)
            optimizer.step()

            print('Epoch [%d/%d], Iter [%d/%d] Losses:  total %.4f'
                      % (epoch + 1, num_epochs, i + 1, len(sh_dataset) // batch_size, loss_seq[0].data))
            nan_flag = False
            for name, param in model.named_parameters():
                if np.any(np.isnan(param.cpu().detach().numpy())):
                    print(name)
                    print('cnt:', np.sum(np.isnan(param.cpu().detach().numpy())))
                    nan_flag = True
            if nan_flag:
                raise Exception('nan')

bhushans23 · January 9, 2019, 4:51am

You can try gradient clipping, here’s the reference
https://pytorch.org/docs/stable/_modules/torch/nn/utils/clip_grad.html

Hui_Lin · January 9, 2019, 5:13am

thx for reply!
I added the sentence below after backward,but it still doesn’t work.

torch.nn.utils.clip_grad_norm(model.parameters(),10)

Hui_Lin · January 9, 2019, 5:19am

I just tried the standard resnet (for 27 classes), only calculated the regression loss(mse), training on the same data set, it will not produce the problem of nan

Hui_Lin · January 9, 2019, 8:29am

Summary

I solved this problem by reducing num_coefs from 27 to 3. This parameter indicates the number of fully connected layers I have added to resnet. Each layer has 50 classes. And I calculated the loss of each fc separately.Each loss contains mse and cross entropy.
But who can explain why 27 fc layers will produce nan and 3 will not?