I want to train a CNN model(ResNet50) in Pytorch with a dataset from Kaggle. I run my code on Kaggle GPU and I received a cuda error:
CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
The number of classes are 15505, and for load the data I use a customized dataloader, not builtin function. This code previously worked with 10 classes on Colab with a pytorch dataloader.
Since you’ve changed the number of classes in your new workflow I would guess your target might contain invalid indices.
Make sure the target contains class indices in [0, nb_classes-1] or rerun your code via CUDA_LAUNCH_BLOCKING=1 python script.py args and check the failing method in the stacktrace.
I run this code with different number of classes many times, 2 classes, 10 classes, 870 classes. I did not receive this error. This time I changed the data loader. Is it possible the problem was because of this?
In addition, I did not understand this solution that you gave me: " rerun your code via CUDA_LAUNCH_BLOCKING=1 python script.py args and check the failing method in the stacktrace."
If you cannot run the script in a terminal and need to run it in a notebook, execute it on the CPU to get a better stacktrace, which would then most likely point to an index error in nn.CrossEntropyLoss.
Also, thanks for the code. Could you double check this?
while you are using this in your code:
# Define relevant variables for the ML task
num_classes = 15501
Also I try with both, num_classes = 15501 and num_classes = 15505. Because labels are originally numbered from 1 to 15505, but the competition data has 15501 labels because we lost four labels during data cleaning process. It is in json file.
I think I found the reason, but I dont know how I can solve it. It seems the model output is typical object classifier for the 1000 different classifications, not more. It might be because of the mismatch between the number of the classes and number of the last layer output. but I do not know how can I fix it! Do you have any solution?