RuntimeError: CUDA error:unknown error in async mode

I’m trying to train a segmentation model. After I’d changed dataset size from 1500 to 4500 images I started to get such an error:

RuntimeError: CUDA error: unknown error
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

It points to line with loss (loss is just pytorch mse):
loss_sum += loss.item()

Training code wasn’t changed at all. I’m using batch_size 32 with 512x512x3 images (I have 24GB gpu so it’s not a problem with OOM). Also I have set pin_memory to True in my dataloader.
Interesting thing: if I set CUDA_LAUNCH_BLOCKING=1 I don’t have any errors. Training becames 10-20% slower but it goes well.
Can you help me please? It looks like a problem with async cuda mode

Could you post a minimal and executable code snippet reproducing the issue, please?

It’s extremely hard to catch. Using pin_memory=True it can crash after 2h of training, setting it to False can delay crash to 10h.
I hadn’t faced this problem when I had smaller dataset. It looks like magic for me

Simplified code:

import torch
import torchvision

def image_to_tensor(x):
    return torchvision.transforms.ToTensor()(x)

class MyDataset(torch.utils.data.Dataset):
    def __init__(self, images, masks):
        self.images = images
        self.masks = masks

    def __len__(self):
        return len(self.images)

    def __getitem__(self, idx):
        return image_to_tensor(self.images[idx]), image_to_tensor(self.masks[idx])

class MseLoss(torch.nn.Module):
    def forward(self, input, target):
        return torch.mean(torch.nn.functional.mse_loss(input, target))
 
ds = MyDataset(images=..., masks=...)
dl = torch.utils.data.DataLoader(ds, batch_size=32, num_workers=8, shuffle=True, pin_memory=True)

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
net = Unet().to(device)
loss_fn = MseLoss().to(device)
optimizer = torch.optim.Adam(net.parameters(), lr=1e-2)

for epoch in range(300):
    net.train()
    loss_sum = 0
    for image, mask in dl:
        image, mask = image.to(device), mask.to(device)
        optimizer.zero_grad()
        pred = net(image)
        loss = loss_fn(pred, mask)
        loss.backward()
        optimizer.step()
        loss_sum += loss.item()

    loss_avg = loss_sum / len(dl)
    print(f'epoch {epoch + 1} loss is {loss_avg}')

I’ve got crash even with CUDA_LAUNCH_BLOCKING=1:

terminate called after throwing an instance of 'c10::Error'
  what():  CUDA error: unknown error
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:44 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7adddd792617 in /home/qew/.local/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7adddd74d98d in /home/qew/.local/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7adde01cb128 in /home/qew/.local/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #3: <unknown function> + 0xe2439d (0x7add5c02439d in /home/qew/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0x510cf6 (0x7add9e110cf6 in /home/qew/.local/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #5: <unknown function> + 0x55ca7 (0x7adddd777ca7 in /home/qew/.local/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #6: c10::TensorImpl::~TensorImpl() + 0x1e3 (0x7adddd76fcb3 in /home/qew/.local/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #7: c10::TensorImpl::~TensorImpl() + 0x9 (0x7adddd76fe49 in /home/qew/.local/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #8: <unknown function> + 0x7c0dc8 (0x7add9e3c0dc8 in /home/qew/.local/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #9: THPVariable_subclass_dealloc(_object*) + 0x305 (0x7add9e3c1155 in /home/qew/.local/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
<omitting python frames>
frame #27: <unknown function> + 0x29d90 (0x7ade23429d90 in /lib/x86_64-linux-gnu/libc.so.6)
frame #28: __libc_start_main + 0x80 (0x7ade23429e40 in /lib/x86_64-linux-gnu/libc.so.6)

Is the GPU still usable afterwards or is your system not able to communicate with it anymore?
Also, did you see any Xids in dmesg?
I doubt it’s a PyTorch error at this point as it sounds as if your system would drop the device. Given it also depends on the workload etc. it might be a good idea to check the thermals of your system to see if parts might be overheating.

Yeah, after this my system isn’t able to communicate with GPU, so I need to reboot it.
Temperature is less than 70deg during all the training, it’s totally ok.
I have realised that my pytorch compiled with cuda 12.1 but I use cuda 12.0. Can it be the reason of the problem?

No, since your locally installed CUDA toolkit won’t be used as the PyTorch binaries ship with their own CUDA runtime dependencies. Even if you are mixing different CUDA libs, it won’t throw your GPU off, but raise a SW error in case of missing symbols etc.

Got it. No ideas what’s wrong with it

Did you already check dmesg for Xids?

Could you please describe what is it?

A detailed description of Xid Errors can be found here.
TL;DR:

The Xid message is an error report from the NVIDIA driver that is printed to the operating system’s kernel log or event log. Xid messages indicate that a general GPU error occurred, most often due to the driver programming the GPU incorrectly or to corruption of the commands sent to the GPU. The messages can be indicative of a hardware problem, an NVIDIA software problem, or a user application problem.
These messages provide diagnostic information that can be used by both users and NVIDIA to aid in debugging reported problems.

You can thus just run dmesg | grep -i xid and check if the driver or another part of your setup is running into an issue. User application errors, e.g. memory violations, are also logged there and you can check the linked docs for a description of the corresponding error (in case you are seeing Xid errors in your log).

Seems to be very helpful. I will check it asap and give you a feedback. Thanks!

Hi! I’ve found such an error after yet another crash:
[23924.128134] NVRM: Xid (PCI:0000:03:00): 79, pid='<unknown>', name=<unknown>, GPU has fallen off the bus.

This error often indicates a bios, power supply, or thermal issue so check if e.g. your PSU is big enough.

Yeah, looks like the truth. Have already ordered more powerful block :slight_smile: