Device-side assert triggered error on AWS EC2 instance

Hi,

I try to train my model using PyTorch and a AWS EC2 Instance (g3s.xlarge), but there is a common error, the famous RuntimeError: CUDA error: device-side assert triggered

So have a more clear stacktrace, I run my sscript using CUDA_LAUNCH_BLOCKING=1, and there is all the stacktrace I have:

  /pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:53: lambda [](int)->auto::operator()(int)->auto: block: [0,0,0], thread: [44,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:53: lambda [](int)->auto::operator()(int)->auto: block: [0,0,0], thread: [26,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
THCudaCheck FAIL file=/pytorch/aten/src/THC/THCCachingHostAllocator.cpp line=265 error=59 : device-side assert triggered
Traceback (most recent call last):
  File "train.py", line 367, in <module>
    backup_every=args.backup_every)
  File "train.py", line 267, in train
    if torch.isnan(loss):
RuntimeError: CUDA error: device-side assert triggered

This error came during the first epoch, or during the test of the first epoch (it depends)

I check my dataset and:

    • There is no negative number
    • All the output files are in Yolo format
    • There is exactly the same number of files and labels

The number of classes I have is 17 (so I use 17 to generate my cfg, and I enter 17 in my .data file)
I can’t identify where is the mistake, do you have any idea ?

Thanks in advance and have a good day !

Kind regards,
Florian

Is your code working fine on the CPU?
If the error is raised in the first epoch, if might not take too long to run it on the CPU.
Based on the stack trace I’m not sure, where the failing indexing operation is used.

This code already works in the past with another dataset on GPU (I already have this error message but was my fault with error in number of classes)

This time I can’t identify the problem.

When i run the same command on CPU (just change the batch_size, and number of epochs), I have the following error:

Using CPU

Traceback (most recent call last):
  File "train.py", line 367, in <module>
    backup_every=args.backup_every)
  File "train.py", line 198, in train
    load_darknet_weights(model, f'weights/{cutoff_name}')
  File "/home/florian/dev/utils/models.py", line 278, in load_darknet_weights
    conv_w = torch.from_numpy(weights[ptr:ptr + num_w]).view_as(conv_layer.weight)
RuntimeError: shape '[512, 256, 3, 3]' is invalid for input of size 776331

If changing the batch size yields this error, most likely you are using a wrong view or reshape operation.
Could you post the model definition and also try to make sure to keep the batch dimension constant e.g. via x.view(x.size(0), -1) instead of x.view(-1, some_features).

My model is Darknet / Yolo Tiny model, I didn’t change the model definition, but you can find the model definition here: https://paste.ofcode.org/YPZftRpwZFsNydkuCfNZvc

I check the code, and I think i keep the batch dimension constant all the time

To update this post:

  • The same code works with epochs = 9, batch_size = 16 and not using the Yolo weights (from scratch) and with less data (around 800 images instead of 10k)

  • But this code doesn’t work with epoch = 9, batch_size = 16 and not using the Yolo weight, if I keep all the dataset (it return the same CUDA error)

It’s very weird …