problem now: I blocked the Cross entropy loss and used only mse,then fixed the previous problem .
So the cross entropy loss is mostly root of the previous problem.
But who can explain why the cross-entropy loss list passed to the autograd.backward() causes nan?
thx!
problem solved:
I used to lower the learning rate, but still get nan. and I found that the first place where nan appeared was layer2.0.conv1.weights (or sometimes layer4.2.bn3.bias) in resnet, and only the last element of weights was nan, the input and loss of this iteration was not nan.but after this iteration, all weights and losses get nan.
the code below is based on :GitHub - natanielruiz/deep-head-pose: 🔥🔥 Deep Learning Head Pose Estimation using PyTorch.
He added 3 fully connected layers on resnet, and I took 27.How to fix this problem? thx guys.
"""traing code"""
for epoch in range(num_epochs):
for i, (images, labels, cont_labels, pic_names) in enumerate(train_loader):
images = Variable(images).cuda(gpu)
label_sh = Variable(labels).cuda(gpu) # Binned labels
label_sh_cont = Variable(cont_labels).cuda(gpu) # Continuous labels
sh = model(images) # prediction
loss_seq = []
for j in range(num_coefs):
# **Cross entropy loss**
loss_sh = criterion(sh[:, j], label_sh[:, j])
# MSE loss
sh_predicted = softmax(sh[:, j])
sh_predicted = torch.sum(sh_predicted * idx_tensor, 1) * (2. / num_bins) - 1
loss_reg_sh = reg_criterion(sh_predicted, label_sh_cont[:, j])
# Total loss
loss_sh += alpha * loss_reg_sh
loss_seq.append(loss_sh)
grad_seq = [torch.ones(1).cuda(gpu) for _ in range(len(loss_seq))]
optimizer.zero_grad()
torch.autograd.backward(loss_seq, grad_seq)
optimizer.step()
print('Epoch [%d/%d], Iter [%d/%d] Losses: total %.4f'
% (epoch + 1, num_epochs, i + 1, len(sh_dataset) // batch_size, loss_seq[0].data))
nan_flag = False
for name, param in model.named_parameters():
if np.any(np.isnan(param.cpu().detach().numpy())):
print(name)
print('cnt:', np.sum(np.isnan(param.cpu().detach().numpy())))
nan_flag = True
if nan_flag:
raise Exception('nan')