CUDA error, Kaggle dataset

I want to train a CNN model(ResNet50) in Pytorch with a dataset from Kaggle. I run my code on Kaggle GPU and I received a cuda error:

CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

The number of classes are 15505, and for load the data I use a customized dataloader, not builtin function. This code previously worked with 10 classes on Colab with a pytorch dataloader.

Anybody can help?

Since you’ve changed the number of classes in your new workflow I would guess your target might contain invalid indices.
Make sure the target contains class indices in [0, nb_classes-1] or rerun your code via CUDA_LAUNCH_BLOCKING=1 python script.py args and check the failing method in the stacktrace.

I run this code with different number of classes many times, 2 classes, 10 classes, 870 classes. I did not receive this error. This time I changed the data loader. Is it possible the problem was because of this?

In addition, I did not understand this solution that you gave me: " rerun your code via CUDA_LAUNCH_BLOCKING=1 python script.py args and check the failing method in the stacktrace."

Thanks,

You can check the code in my github:

If you cannot run the script in a terminal and need to run it in a notebook, execute it on the CPU to get a better stacktrace, which would then most likely point to an index error in nn.CrossEntropyLoss.

Also, thanks for the code. Could you double check this?

while you are using this in your code:

# Define relevant variables for the ML task
num_classes = 15501

I check with the cpu and I receive wired error, " loss is not definename ‘loss’ is not defined".

Also I try with both, num_classes = 15501 and num_classes = 15505. Because labels are originally numbered from 1 to 15505, but the competition data has 15501 labels because we lost four labels during data cleaning process. It is in json file.

The new issue points towards a script error and I guess you might be running into the exception:

except Exception as e:
    log.error(f"Exception in data processing- skip and continue = {e}")

without skipping this iteration.
Check if your try/except logic already raised a valid error and fix it first.

I think I found the reason, but I dont know how I can solve it. It seems the model output is typical object classifier for the 1000 different classifications, not more. It might be because of the mismatch between the number of the classes and number of the last layer output. but I do not know how can I fix it! Do you have any solution?

Yes, this sounds reasonable as I didn’t realize you are using a standard ResNet without changing the output layer.
Use:

model = torchvision.models.resnet50()
model.fc = nn.Linear(2048, 15505)
model.to(device)

to replace the last linear layer with a new one returning logits for 15505 classes.

Thanks, It seems it is ok now, but the data is enormous and need GPU to run.