RuntimeError: CUDA error: device-side assert triggered CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect

Hello. The following code snippet has a runtime error

Relevant error:
File “run_dst.py”, line 863, in
main()
File “run_dst.py”, line 848, in main
result = evaluate(args, model, tokenizer, processor, prefix=global_step)
File “run_dst.py”, line 296, in evaluate
outputs = model(**inputs)
File “/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py”, line 1102, in _call_impl
return forward_call(*input, **kwargs)
File “/content/drive/My Drive/graph_transformer/modeling_bert_dst.py”, line 430, in forward
start_loss = token_loss_fct(start_logits, start_pos[slot])
File “/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py”, line 1102, in _call_impl
return forward_call(*input, **kwargs)
File “/usr/local/lib/python3.7/dist-packages/torch/nn/modules/loss.py”, line 1152, in forward
label_smoothing=self.label_smoothing)
File “/usr/local/lib/python3.7/dist-packages/torch/nn/functional.py”, line 2846, in cross_entropy
return torch._C._nn.cross_entropy_loss(input, target, weight, _Reduction.get_enum(reduction), ignore_index, label_smoothing)

As can be seen in the relevant code snippet, the original author of the code tried to manage this error.
I got this error by changing other parts of the code that are not related to this part, I do not know what is the reason?
Could it be because of the version of Pytorch?
Because the version used by the author is 1.4.0 and I used version 1.10.0.

you should check the code on cpu and see what is the actual error.
but i guess it’s becuase of target dtype.
you can try .long().cuda() on all of the cross entroypy loss targets.
by the way you’re creating your loss functions in you the training loop.
it’s better to move it out

I made a correction but it still gives the same error?

Your target values are out of bounds. nn.CrossEntropyLoss expects the target to contain class indices in the range [0, nb_classes-1], so check the min/max values of the target and make sure they contain only valid values.

3 Likes

Yes, the number of class labels was not specified correctly.
I was confused just because the debugger did not pinpoint the error.
Thankful