I get the CUDA device-side asset triggered only when running the model on Pytorch’s EMNIST dataset.
It runs without any issues on MNIST, FashionMNIST, GTSRB, and Food101
I am changing the number of output neurons according to the dataset.
The error message is as below
../aten/src/ATen/native/cuda/Loss.cu:240: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [13,0,0] Assertion `t >= 0 && t < n_classes` failed.
Traceback (most recent call last):
File "train_t.py", line 351, in <module>
loss.backward()
File "/home/dir/.local/lib/python3.8/site-packages/torch/_tensor.py", line 487, in backward
torch.autograd.backward(
File "/home/dir/.local/lib/python3.8/site-packages/torch/autograd/__init__.py", line 200, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions
Version details:
Torch : 2.0.1+cu117
Python : 3.8
Cuda : 11.7
Additional details:
The models used are either pretrained on CIFAR10 or CIFAR100, So I use the following transforms when training on MNIST, FashionMNIST, or EMNIST:
torchvision.transforms.Grayscale(num_output_channels=3),
torchvision.transforms.Resize((32,32))
Without the change in number of output channels, it throws the error :
RuntimeError: output with shape [1, 32, 32] doesn't match the broadcast shape [3, 32, 32]
Any suggestion how to resolve the CUDA error for EMNIST ?