RuntimeError: CUDA error: device-side assert triggered - Resnet18

ziad · April 26, 2021, 2:21pm

I’m trying to use PyTorch’s Resnet18 model with my image data. Given the complexity of the model as well as the size of the data, I’d like to run it using CUDA. I’m doing the follow:

resnet_cnn = models.resnet18(pretrained = True)
num_ftrs = resnet_cnn.fc.in_features
resnet_cnn.fc = nn.Linear(num_ftrs, 8)

criterion = nn.CrossEntropyLoss().cuda()
optimizer_ft = optim.SGD(resnet_cnn.parameters(), lr=0.001, momentum=0.9)
exp_lr_scheduler = lr_scheduler.StepLR(optimizer_ft, step_size=5, gamma=0.1)

After this, I attempt to train and test my model with the following loop:

count = 0
loss_list = []
iteration_list = []
accuracy_list = []
epochs = 30

for epoch in range(epochs):
    for i, (images, labels) in enumerate(trainloader):
            resnet_cnn = resnet_cnn.cuda()
            images.cuda()
            labels.cuda()

            optimizer_ft.zero_grad()
            outputs = resnet_cnn(images.cuda())
            loss = criterion(outputs.cuda(), labels.cuda())
            loss.backward()
            optimizer_ft.step()

            count += 1

            if count % 50 == 0:
                correct = 0
                total = 0

                for i, (images, labels) in enumerate(testloader):
                    # images.to(device)
                    # labels.to(device)

                    outputs = resnet_cnn(images.cuda())
                    predicted = torch.max(outputs.data, 1)[1]
                    total += len(labels)
                    correct += (predicted == labels.cuda()).sum()
                accuracy = 100 * correct / float(total)

                loss_list.append(loss.data)
                iteration_list.append(count)
                accuracy_list.append(accuracy)

                if count % 500 == 0:
                    print("Iteration: {} Loss: {} Accuracy: {} %".format(count, loss.data, accuracy))

But I am met with the following error:

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-48-cb669e8d47c0> in <module>()
      7 for epoch in range(epochs):
      8     for i, (images, labels) in enumerate(trainloader):
----> 9             resnet_cnn = resnet_cnn.cuda()
     10             images.cuda()
     11             labels.cuda()

3 frames
/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py in cuda(self, device)
    489             Module: self
    490         """
--> 491         return self._apply(lambda t: t.cuda(device))
    492 
    493     def xpu(self: T, device: Optional[Union[int, device]] = None) -> T:

/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py in _apply(self, fn)
    385     def _apply(self, fn):
    386         for module in self.children():
--> 387             module._apply(fn)
    388 
    389         def compute_should_use_set_data(tensor, tensor_applied):

/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py in _apply(self, fn)
    407                 # `with torch.no_grad():`
    408                 with torch.no_grad():
--> 409                     param_applied = fn(param)
    410                 should_use_set_data = compute_should_use_set_data(param, param_applied)
    411                 if should_use_set_data:

/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py in <lambda>(t)
    489             Module: self
    490         """
--> 491         return self._apply(lambda t: t.cuda(device))
    492 
    493     def xpu(self: T, device: Optional[Union[int, device]] = None) -> T:

RuntimeError: CUDA error: device-side assert triggered

I can’t figure what I’m doing wrong as I’ve trained another manually-defined CNN in the same way. Thank you in advance.

ptrblck · April 27, 2021, 5:44am

Could you rerun the code via:

CUDA_LAUNCH_BLOCKING=1 python script.pt args

and post the complete stack trace here, please?

ziad · April 27, 2021, 8:54am

Thanks, @ptrblck - after evaluating CUDA_LAUNCH_BLOCKING=1 python script.pt args , I saw that I was using the incorrect number of GPU