problem now: I blocked the Cross entropy loss and used only mse,then fixed the previous problem .
So the cross entropy loss is mostly root of the previous problem.
But who can explain why the cross-entropy loss list passed to the autograd.backward() causes nan?
I used to lower the learning rate, but still get nan. and I found that the first place where nan appeared was layer2.0.conv1.weights (or sometimes layer4.2.bn3.bias) in resnet, and only the last element of weights was nan, the input and loss of this iteration was not nan.
but after this iteration, all weights and losses get nan.
the code below is based on ：https://github.com/natanielruiz/deep-head-pose
He added 3 fully connected layers on resnet, and I took 27.
How to fix this problem? thx guys.
"""traing code""" for epoch in range(num_epochs): for i, (images, labels, cont_labels, pic_names) in enumerate(train_loader): images = Variable(images).cuda(gpu) label_sh = Variable(labels).cuda(gpu) # Binned labels label_sh_cont = Variable(cont_labels).cuda(gpu) # Continuous labels sh = model(images) # prediction loss_seq =  for j in range(num_coefs): # **Cross entropy loss** loss_sh = criterion(sh[:, j], label_sh[:, j]) # MSE loss sh_predicted = softmax(sh[:, j]) sh_predicted = torch.sum(sh_predicted * idx_tensor, 1) * (2. / num_bins) - 1 loss_reg_sh = reg_criterion(sh_predicted, label_sh_cont[:, j]) # Total loss loss_sh += alpha * loss_reg_sh loss_seq.append(loss_sh) grad_seq = [torch.ones(1).cuda(gpu) for _ in range(len(loss_seq))] optimizer.zero_grad() torch.autograd.backward(loss_seq, grad_seq) optimizer.step() print('Epoch [%d/%d], Iter [%d/%d] Losses: total %.4f' % (epoch + 1, num_epochs, i + 1, len(sh_dataset) // batch_size, loss_seq.data)) nan_flag = False for name, param in model.named_parameters(): if np.any(np.isnan(param.cpu().detach().numpy())): print(name) print('cnt:', np.sum(np.isnan(param.cpu().detach().numpy()))) nan_flag = True if nan_flag: raise Exception('nan')