Ce_loss reduction='mean'

Hi,
I’m trying to troubleshoot apparitions of nan values that seemed to appear during my training steps. For this purpose I used torch.anomaly_detection() and caught the errors in order to save my variables and state_directory to more accurately highlight the issue.

I was able to highlight my problem with the calculation of the CrossEntropyLoss, more specifically, when I compute the loss with the my logits & target tensors i get :

target = torch.load('../target.pt', map_location='cuda:2')
logits = torch.load('../logits.pt', map_location='cuda:2')

# adjust cross-entropy to the expected dimensions
_, target = target.max(dim = 1)
# computing the loss with default 'mean' reduction
loss = F.cross_entropy(logits, target)
loss
tensor(nan, device='cuda:2', dtype=torch.float16, grad_fn=<NllLoss2DBackward0>)

However, when I perform the calculation without the reduction, I get the following :

loss = F.cross_entropy(logits, target, reduction = 'none')
print(f"{loss.shape =}")
print(f"{loss.isnan().any() =}")
print(f"{loss.isinf().any() =}")
loss.shape =torch.Size([16, 64, 72, 64])
loss.isnan().any() =tensor(False, device='cuda:2')
loss.isinf().any() =tensor(False, device='cuda:2')

where no nan values exists in the output. This lead me to believe that the nan appears during the reduction operation. However from the pytorch documentation, this would only be possible if all the weights are 0. but I’m not sure I understand this because I left out the weight parameters to None.

From my understanding the reduction is performed first channel-wise were the weighted losses for each channel are divided by the sum of the weights (at least in case of the mean), then it is performed spatially on the image dimension.

Could you describe to me in more detail how is the reduction operation performed? Could you point me if I missed something out/doing something wrong?

I also have a question with regards to the required input dimensions, I don’t quite understand why the implementation do not accept one-hot encoded tensor, as I believe it would be faster. If you could point out how the computation are performed, I would be grateful (the source code only pointed me towards C code, and I’m not really familiar with this)

Thanks for your help,

PS:
More details on the tensors

print(f'{logits.max().item() =}, {logits.min().item() =}')
print(f'{logits.shape =}, {target.shape =}')
print(f'{logits.isnan().any(), logits.isinf().any()}')
print(f'{target.isnan().any(), target.isinf().any()}')
print(f'{logits.dtype}')
print(f'{target.dtype}')

logits.max().item() =382.0, logits.min().item() =-330.75
logits.shape =torch.Size([16, 135, 64, 72, 64]), target.shape =torch.Size([16, 64, 72, 64])
(tensor(False, device='cuda:2'), tensor(False, device='cuda:2'))
(tensor(False, device='cuda:2'), tensor(False, device='cuda:2'))
torch.float16
torch.int64

tensors are in float16 because I’m training with autocast.

pip show torch
Version : torch2.0.1+cu118

Could you post a minimal and executable code snippet reproducing these NaN values, please?

Hi,
You can reproduce the problem pretty easily by generating random tensors w/

import torch
import torch.nn.functional as F

nb_batchs = 16
nb_classes = 135
shape = (64,72,64)

# send to gpu here (internal softmax on cpu raise NotImplementedError)
x = torch.randn((nb_batchs, nb_classes,) + shape, dtype = torch.float16).to('cuda')
y = torch.randint(nb_classes, (nb_batchs,) + shape , dtype = torch.int64).to('cuda')

loss = F.cross_entropy(x, y)
loss.isnan()
tensor(True, device='cuda')

in case the randomness doesn’t go our way (but I get it every time) you can do :

for i in range(10):
    x = torch.randn((nb_batchs, nb_classes,) + shape, dtype = torch.float16).to('cuda')
    y = torch.randint(nb_classes, (nb_batchs,) + shape , dtype = torch.int64).to('cuda')

    loss = F.cross_entropy(x, y)
    print(loss.isnan())

(note that I’m quite confused that I get this every time w/ random tensors as my issue did not happen directly during training , but rather later on as shown below)

[Epoch 67]	Current learning rate: 9.95e-3
[2/140]	[DiceLoss: 0.1040][CrossEntropyLoss: 0.1506][Total: 0.2546]	1.67+4.67s
[4/140]	[DiceLoss: 0.1119][CrossEntropyLoss: 0.1450][Total: 0.2569]	1.63+0.37s
[6/140]	[DiceLoss: 0.1105][CrossEntropyLoss: 0.1399][Total: 0.2504]	1.63+0.54s

# Here CrossEntropy starts increasing
[38/140]	[DiceLoss: 0.1090][CrossEntropyLoss: 0.1474][Total: 0.2564]	1.64+0.36s
[40/140]	[DiceLoss: 0.1094][CrossEntropyLoss: 0.1543][Total: 0.2637]	1.64+0.36s
[42/140]	[DiceLoss: 0.1094][CrossEntropyLoss: 0.1586][Total: 0.2681]	1.64+0.36s

# We have higher value here, and then NaNs
[60/140]	[DiceLoss: 0.1188][CrossEntropyLoss: 0.2790][Total: 0.3978]	1.64+0.38s
[62/140]	[DiceLoss: 0.1193][CrossEntropyLoss: 0.2786][Total: 0.3978]	1.65+0.37s
[64/140]	[DiceLoss: 0.1219][CrossEntropyLoss: 0.3120][Total: 0.4339]	1.65+0.38s
[66/140]	[DiceLoss: nan][CrossEntropyLoss: nan][Total: nan]	1.64+0.36s
[68/140]	[DiceLoss: nan][CrossEntropyLoss: nan][Total: nan]	1.65+0.37s
[70/140]	[DiceLoss: nan][CrossEntropyLoss: nan][Total: nan]	1.65+0.41s

(I removed the print for some mini-batch for readability)

In your previous post you claimed to use autocast:

while you are now casting the tensors manually.
Use autocast as seen in e.g. this example and the code will work.

import torch
import torch.nn.functional as F

nb_batchs = 16
nb_classes = 135
shape = (64,72,64)

# send to gpu here (internal softmax on cpu raise NotImplementedError)
x = torch.randn((nb_batchs, nb_classes,) + shape, dtype = torch.float16).to('cuda')
y = torch.randint(nb_classes, (nb_batchs,) + shape , dtype = torch.int64).to('cuda')

with torch.cuda.amp.autocast():
    loss = F.cross_entropy(x, y)
print(loss.isnan())
# tensor(False, device='cuda:0')

Hi,

First, thank you for your help. Could you explain in more detail why computing the cross entropy with half (float16) tensors without the autocast results in nan values? I noticed after further testing that this only happens when using reduction = ‘mean’ (e.g no nans are present in the output tensor)

Also, would you be able to expand on what happens when using the autocast context manager?

In the example that you provided, if you print the type of the output tensor, you obtain :

torch.float32

after providing a torch.float16 tensor to the cross entropy operation.

It was my understanding (from this paper : https://arxiv.org/pdf/1710.03740.pdf) that autocast would use accumulation of float16 ops (on supported operation) to emulate the precision of float32 ops, and then use float16 for all read/write (except for model weight) ops for faster memory access.
How do pytorch choose the storing type of each tensor?

I also have an unrelated question on the implementation of the cross entropy. From my understanding calculation on categorical labels usually incurs an overhead in calculation. However this does not seem to be a problem for the pytorch implementation (providing a N,C,w,h,d tensor is reserved behavior for calculation with soft probabilities). Would you be able to explain how does pytorch achieve this?