RuntimeError: cuda runtime error (59) : device-side assert triggered at /media/nvidia/WD_BLUE_2.5_1TB/pytorch-v1.1.0/aten/src/THC/generic/THCTensorMath.cu:26

hmk · August 29, 2021, 10:04pm

Hi, im building a jetbot with the sparkfun jetson nano 2GB kit. The V01-00 image worked though the training model isn’t working properly. It seems to have issues with the cuda system. These our the errors I have been getting:
1.RuntimeError: CUDA error: device-side assert triggered - at one of the times I tried to run the program.
2. RuntimeError: cuda runtime error (59) : device-side assert triggered at /media/nvidia/WD_BLUE_2.5_1TB/pytorch-v1.1.0/aten/src/THC/generic/THCTensorMath.cu:16.
3.RuntimeError: cuda runtime error (59) : device-side assert triggered at /media/nvidia/WD_BLUE_2.5_1TB/pytorch-v1.1.0/aten/src/THC/generic/THCTensorMath.cu:26

This is my code:
NUM_EPOCHS = 30
BEST_MODEL_PATH = ‘best_model.pth’
best_accuracy = 0.0

optimizer = optim.SGD(model.parameters(), lr=0.001, momentum=0.9)
for i, data in enumerate(all_dataloader):

for epoch in range(NUM_EPOCHS):

    for images, labels in iter(train_loader):
        images = images.to(device)
        labels = labels.to(device)
        optimizer.zero_grad()
        outputs = model(images)
        loss = F.cross_entropy(outputs, labels)
        loss.backward()
        optimizer.step()

    test_error_count = 0.0
    for images, labels in iter(test_loader):
        images = images.to(device)
        labels = labels.to(device)
        outputs = model(images)
        test_error_count += float(torch.sum(torch.abs(labels - outputs.argmax(1))))

    test_accuracy = 1.0 - float(test_error_count) / float(len(test_dataset))
    print('%d: %f' % (epoch, test_accuracy))
    if test_accuracy > best_accuracy:
        torch.save(model.state_dict(), BEST_MODEL_PATH)
        best_accuracy = test_accuracy

This is the error:

RuntimeError Traceback (most recent call last)
in
13 outputs = model(images)
14 loss = F.cross_entropy(outputs, labels)
—> 15 loss.backward()
16 optimizer.step()
17

/usr/local/lib/python3.6/dist-packages/torch/tensor.py in backward(self, gradient, retain_graph, create_graph)
105 products. Defaults to False.
106 “”"
→ 107 torch.autograd.backward(self, gradient, retain_graph, create_graph)
108
109 def register_hook(self, hook):

/usr/local/lib/python3.6/dist-packages/torch/autograd/init.py in backward(tensors, grad_tensors, retain_graph, create_graph, grad_variables)
91 Variable._execution_engine.run_backward(
92 tensors, grad_tensors, retain_graph, create_graph,
—> 93 allow_unreachable=True) # allow_unreachable flag

RuntimeError: cuda runtime error (59) : device-side assert triggered at /media/nvidia/WD_BLUE_2.5_1TB/pytorch-v1.1.0/aten/src/THC/generic/THCTensorMath.cu:26

picture is attached.

I would appreciate your assistance to proceed in my project. @albanD

albanD · August 31, 2021, 1:34pm

Hi,

You can try to run with CUDA_LAUNCH_BLOCKING=1 to get a more accurate error.
Also can you try running the code on the CPU? It might be a problem with the embedded GPU?