Hi,
I’m trying to troubleshoot apparitions of nan values that seemed to appear during my training steps. For this purpose I used torch.anomaly_detection() and caught the errors in order to save my variables and state_directory to more accurately highlight the issue.
I was able to highlight my problem with the calculation of the CrossEntropyLoss, more specifically, when I compute the loss with the my logits & target tensors i get :
target = torch.load('../target.pt', map_location='cuda:2')
logits = torch.load('../logits.pt', map_location='cuda:2')
# adjust cross-entropy to the expected dimensions
_, target = target.max(dim = 1)
# computing the loss with default 'mean' reduction
loss = F.cross_entropy(logits, target)
loss
tensor(nan, device='cuda:2', dtype=torch.float16, grad_fn=<NllLoss2DBackward0>)
However, when I perform the calculation without the reduction, I get the following :
loss = F.cross_entropy(logits, target, reduction = 'none')
print(f"{loss.shape =}")
print(f"{loss.isnan().any() =}")
print(f"{loss.isinf().any() =}")
loss.shape =torch.Size([16, 64, 72, 64])
loss.isnan().any() =tensor(False, device='cuda:2')
loss.isinf().any() =tensor(False, device='cuda:2')
where no nan values exists in the output. This lead me to believe that the nan appears during the reduction operation. However from the pytorch documentation, this would only be possible if all the weights are 0. but I’m not sure I understand this because I left out the weight parameters to None.
From my understanding the reduction is performed first channel-wise were the weighted losses for each channel are divided by the sum of the weights (at least in case of the mean), then it is performed spatially on the image dimension.
Could you describe to me in more detail how is the reduction operation performed? Could you point me if I missed something out/doing something wrong?
I also have a question with regards to the required input dimensions, I don’t quite understand why the implementation do not accept one-hot encoded tensor, as I believe it would be faster. If you could point out how the computation are performed, I would be grateful (the source code only pointed me towards C code, and I’m not really familiar with this)
Thanks for your help,
PS:
More details on the tensors
print(f'{logits.max().item() =}, {logits.min().item() =}')
print(f'{logits.shape =}, {target.shape =}')
print(f'{logits.isnan().any(), logits.isinf().any()}')
print(f'{target.isnan().any(), target.isinf().any()}')
print(f'{logits.dtype}')
print(f'{target.dtype}')
logits.max().item() =382.0, logits.min().item() =-330.75
logits.shape =torch.Size([16, 135, 64, 72, 64]), target.shape =torch.Size([16, 64, 72, 64])
(tensor(False, device='cuda:2'), tensor(False, device='cuda:2'))
(tensor(False, device='cuda:2'), tensor(False, device='cuda:2'))
torch.float16
torch.int64
tensors are in float16 because I’m training with autocast.
pip show torch
Version : torch2.0.1+cu118