RuntimeError: CUDA error: device-side assert triggered error

hello,
i trained my model and everything went cool but when testing it this error shows up
RuntimeError: CUDA error: device-side assert triggered

Hi,

You can run your code with CUDA_LAUNCH_BLOCKING=1 python your_script.py to make it return a better error message.
Also it should have printed in the terminal more info about the assert no?

am using jupyter notebook

You should make sure that you set the environment variable just after reseting the kernel and before running any cuda-related code (before importing torch if possible).

I did, Cuda works fine in the training phase and just after that when the testing phase starts the problem pops up

Can you give the full log please?
And if you can get a small code repro (around 30 lines) that would be very helpful.

----> 2 images=images.to(device)
3 labels=labels.to(device)
RuntimeError:CUDA error: device-side assert triggered error

that is exactly the error I am getting
and this is the code causing it:

total_loss = 0
total_correct=0
for batch in testloader:
images, labels = batch
images=images.to(device)
labels=labels.to(device)
preds = net(images)
loss = loss_function(preds,labels)
total_loss += loss.item()
total_correct += get_correct(preds, labels)

I would be very surprised if that line threw that. This is most likely another that is responsible and the async API is pointing at the wrong line.
Are you sure that you set CUDA_LAUNCH_BLOCKING properly?

Sadly I am sure xD
I wasn’t able figure out a solution because I couldn’t know how that line causes such an error especially that I am usining the same line above in the code and it went fine

One thing to know is that once a GPU sends an assert, it goes in a bad state and the whole process needs to be restarted, in your case, you need to restart the kernel from scratch. Otherwise, you will see random errors every time you try to use the GPU (this is a limitation on CUDA side).

You should double check also is that every place where you using indices (targets for loss, or when indexing), check that these indices are in the proper range.

If both of these are done, you will have to remove code to try and pinpoint the reason for the failure I’m afraid.