CUDA error, Kaggle dataset

Maryam_Darei · January 28, 2023, 5:03pm

I want to train a CNN model(ResNet50) in Pytorch with a dataset from Kaggle. I run my code on Kaggle GPU and I received a cuda error:

CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

The number of classes are 15505, and for load the data I use a customized dataloader, not builtin function. This code previously worked with 10 classes on Colab with a pytorch dataloader.

Anybody can help?

ptrblck · January 28, 2023, 5:18pm

Since you’ve changed the number of classes in your new workflow I would guess your target might contain invalid indices.
Make sure the target contains class indices in [0, nb_classes-1] or rerun your code via CUDA_LAUNCH_BLOCKING=1 python script.py args and check the failing method in the stacktrace.

Maryam_Darei · January 28, 2023, 10:15pm

I run this code with different number of classes many times, 2 classes, 10 classes, 870 classes. I did not receive this error. This time I changed the data loader. Is it possible the problem was because of this?

In addition, I did not understand this solution that you gave me: " rerun your code via CUDA_LAUNCH_BLOCKING=1 python script.py args and check the failing method in the stacktrace."

Thanks,

Maryam_Darei · January 29, 2023, 12:11am

You can check the code in my github:

github.com

maryamdarei/herbarium2022/blob/main/herbarium2022_resnet50.ipynb

{
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "view-in-github",
        "colab_type": "text"
      },
      "source": [
        "<a href=\"https://colab.research.google.com/github/maryamdarei/herbarium2022/blob/main/herbarium2022_resnet50.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "SKQ4bH7qMGrA"
      },
      "source": [
        "# Making the Most of your Colab Subscription\n",
        "\n"

This file has been truncated. show original

ptrblck · January 29, 2023, 4:40am

If you cannot run the script in a terminal and need to run it in a notebook, execute it on the CPU to get a better stacktrace, which would then most likely point to an index error in nn.CrossEntropyLoss.

Also, thanks for the code. Could you double check this?

while you are using this in your code:

# Define relevant variables for the ML task
num_classes = 15501

Maryam_Darei · January 29, 2023, 5:13pm

I check with the cpu and I receive wired error, " loss is not definename ‘loss’ is not defined".

Also I try with both, num_classes = 15501 and num_classes = 15505. Because labels are originally numbered from 1 to 15505, but the competition data has 15501 labels because we lost four labels during data cleaning process. It is in json file.

ptrblck · January 29, 2023, 8:24pm

The new issue points towards a script error and I guess you might be running into the exception:

except Exception as e:
    log.error(f"Exception in data processing- skip and continue = {e}")

without skipping this iteration.
Check if your try/except logic already raised a valid error and fix it first.

Maryam_Darei · January 31, 2023, 4:42am

I think I found the reason, but I dont know how I can solve it. It seems the model output is typical object classifier for the 1000 different classifications, not more. It might be because of the mismatch between the number of the classes and number of the last layer output. but I do not know how can I fix it! Do you have any solution?

ptrblck · January 31, 2023, 5:44am

Yes, this sounds reasonable as I didn’t realize you are using a standard ResNet without changing the output layer.
Use:

model = torchvision.models.resnet50()
model.fc = nn.Linear(2048, 15505)
model.to(device)

to replace the last linear layer with a new one returning logits for 15505 classes.

Maryam_Darei · February 6, 2023, 4:31pm

Thanks, It seems it is ok now, but the data is enormous and need GPU to run.