Unable to get repr for <class 'torch.Tensor'> in CrossEntropyloss

Fuzhiyuan · August 30, 2022, 9:25am

I meet this problem when i used this

loss_function=nn.CrossEntropyLoss()
loss2 = (1 - weight) * loss_function( student_result, torch.squeeze(label,dim=1))

student_result shape is 4,7,128,128,128
label shape is 4,1,128,128,128
both of them in GPU
student_result comes from model final layer which only has a Conv3d() that changes channel to 7

I have tried to use torch.softmax(student_result,dim=1) to fix it, but it did not work.

Fuzhiyuan · August 30, 2022, 9:59am

When i put both of student_result and label on cpu and comment out with autucast(), it will work fine. So, what should i do to make this code work correctly in gpu and fp16 mode.

ptrblck · August 30, 2022, 4:53pm

Your code works fine:

student_result = torch.randn(4,7,128,128,128, device='cuda', requires_grad=True)
label = torch.randint(0, 7, (4,1,128,128,128), device='cuda')
weight = torch.rand_like(student_result)

loss_function = nn.CrossEntropyLoss()
loss2 = (1 - weight) * loss_function( student_result, torch.squeeze(label,dim=1))
loss2.mean().backward()

so could you explain the issue in more detail and where this error is thrown?

Fuzhiyuan · August 31, 2022, 1:50am

hi @ptrblc:

Here is my train code:

amp_grad_scaler = GradScaler()
for epoch in range(begin_epoch,end_epoch):
        student_module.train()
        for i,batch in enumerate(DataLoader):
            data=batch['data'].to(torch.float32).cuda()
            hard_target=batch['hard_target'].to(torch.long).cuda()
            with torch.no_grad():
                teacher_result=teacher_module(data)
            with autocast():
                student_result=student_module(data)
                loss=LossFunction(teacher_result,student_result,hard_target,Temperature,0.7)
                amp_grad_scaler.scale(loss).backward()
                amp_grad_scaler.step((optimizer))
                amp_grad_scaler.update()
                sum_loss+=loss

Error messgae:

Traceback (most recent call last):
  File "/home/XXX/Code_Wrap/Distilling_NNUNET-main/main.py", line 133, in main
    loss=LossFunction(teacher_result,student_result,hard_target,Temperature,0.7)
  File "/home/XXX/Code_Wrap/Distilling_NNUNET-main/main.py", line 53, in LossFunction
    loss1=weight*loss_function(torch.softmax(student_result/T,dim=1),torch.softmax(softtarget/T,dim=1),)
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

Although this message tell us loss1 may be wrong, the unable to get repr occured in loss2 that i have pasted above.

ptrblck · August 31, 2022, 2:50am

The current error now changed to:

RuntimeError: CUDA error: device-side assert triggered

which is raised e.g. if you are passing invalid target tensors to nn.CrossEntropyLoss:

criterion = nn.CrossEntropyLoss()

output = torch.randn(10, 10, device='cuda', requires_grad=True)
target = torch.randint(0, 10, (10,), device='cuda')
target[0] = 10

loss = criterion(output, target)
# ../aten/src/ATen/native/cuda/Loss.cu:271: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [0,0,0] Assertion `t >= 0 && t < n_classes` failed.

Check the target min. and max. values and make sure they are in [0, nb_classes-1].