Hello,
I have written my class with autograd which means I have implemented the backwards too. When I run the code, I got random CUDA errors.
RuntimeError: CUDA error: an illegal memory access was encountered
This is one of the four errors I receive. These are three others:
nonzero_finite_vals = torch.masked_select(tensor_view, torch.isfinite(tensor_view) & tensor_view.ne(0))
RuntimeError: cuda runtime error (77) : an illegal memory access was encountered at /pytorch/aten/src/THC/THCReduceAll.cuh:327
RuntimeError: cuda runtime error (77) : an illegal memory access was encountered at /pytorch/aten/src/THC/THCCachingHostAllocator.cpp:278
File “/projects/ovcare/classification/ywang/myenv/lib/python3.7/site-packages/torch/autograd/init.py”, line 99, in backward
allow_unreachable=True) # allow_unreachable flag
RuntimeError: CUDA error: CUBLAS_STATUS_INTERNAL_ERROR when callingcublasSgemm( handle, opa, opb, m, n, k, &alpha, a, lda, b, ldb, &beta, c, ldc)
Based on previous posts, some suggested to run with CUDA_LAUNCH_BLOCKING=1 python your_script.py.
I use multiple GPUs (Dataparallel), but when I use this code, my code is so slow that after one hour I did not received single batch of the first epoch. When I just try one GPU, it works, and after 4 epochs it gives me this error:
num_x = x.view(1, H, W) - x_points.view(batch_size, 1, 1)
RuntimeError: CUDA error: an illegal memory access was encountered
It saying me this line has a problem. This is my code below:
coords = torch.tensor([[h, w] for h in range(H) for w in range(W)], device=device)
x = coords[:, 0]; y = coords[:, 1]
x, y = x.reshape(H, W), y.reshape(H, W)
num_x = x.view(1, H, W) - x_points.view(batch_size, 1, 1)
num_y = y.view(1, H, W) - y_points.view(batch_size, 1, 1)
This is the @staticmethod def backward(ctx, grad_output)
of my defined model.
I cannot understand where is the problem? I am defining the x
and y
tensors, so where is the illegal access.
Another strange thing is that I print x.view(1, H, W)
and x_points.view(batch_size, 1, 1)
separately before this line, and it works fine and printing correctly in the time that it crashes, but when I again want to print it after calculating num_x
, I get this error.
I have read others’ posts, and they say they had a problem with indexing and …, but in my case, I am calculating the output gradient, and it works fine and the shape pf gradient is correct. I can print the shape of these tensors without any error, but when I want to print them, it gives me error.