I am facing the same trouble as the original author posted. I multiply the results of two softmax outputs (softmax over two different dimentions). Then I sum the tensor over one dimention to get the final output scores, say a 20-d tensor. Here is the output score which triggers the cuda AssertionError
, specifically one value 1.0000e+00
, which in theory should not happy.
I assume this is related to floating-point precision error. This error is not stable to reproduce. I got the error sometimes around 3k steps and sometimes after 10k during training.
Does it imply that we should clamp
the tensor whenever we use the binary_cross_entropy_loss
? I think it might be a good idea to log what value is actually causing the AssertionError
.
tensor([9.4490e-05, 1.3122e-06, 1.9130e-03, 1.1611e-04, 3.1499e-05, 7.9529e-05,
5.0480e-05, 1.0000e+00, 2.0515e-04, 1.4706e-06, 3.1726e-05, 1.7213e-09,
8.1568e-05, 6.2557e-06, 1.4758e-06, 2.2086e-04, 1.9921e-04, 7.1404e-05,
6.8685e-06, 1.0655e-04], device='cuda:0', grad_fn=<SumBackward1>)
cls_prob = F.softmax(cls_score, dim=1) # across classes [2000,20]
det_prob = F.softmax(det_score, dim=0) # across proposals/detections [2000,20]
predict = F.mul(cls_prob, det_prob) # shape: [2000,20]
pred_class_scores = sum(predict, dim=0) # [20]
loss = F.binary_cross_entropy(pred_class_scores, label, size_average=False)