CUDA error: an illegal memory access was encountered with reproduction

Oliver · October 31, 2021, 2:00pm

I have been getting this error.
It suggested I could report an issue if the following code gets the same error, which it does:

import torch
torch.backends.cuda.matmul.allow_tf32 = True
torch.backends.cudnn.benchmark = False
torch.backends.cudnn.deterministic = False
torch.backends.cudnn.allow_tf32 = True
data = torch.randn([25, 128, 63, 63], dtype=torch.float, device='cuda', requires_grad=True)
net = torch.nn.Conv2d(128, 128, kernel_size=[2, 2], padding=[1, 1], stride=[1, 1], dilation=[1, 1], groups=1)
net = net.cuda().float()
out = net(data)
out.backward(torch.randn_like(out))
torch.cuda.synchronize()

What does this mean? How can I fix this?

Thanks!!

tom · October 31, 2021, 2:35pm

What PyTorch and CUDA version and what hardware are you using?
On my self-compiled recent git checkout (1.11.0a0+git0a07488) and cuda 11.3 I don’t get any error on a RTX3090.

Best regards

Thomas

Oliver · October 31, 2021, 4:21pm

Hi Thomas,

I’m actually glad to hear, probably issue with my code then

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2020 NVIDIA Corporation
Built on Wed_Jul_22_19:09:09_PDT_2020
Cuda compilation tools, release 11.0, V11.0.221
Build cuda_11.0_bu.TC445_37.28845127_0

pytorch version: 1.9.1

would you know what this error actually means?

thanks!

Oliver · October 31, 2021, 4:34pm

Hmm, I just tried and I have the same error on colab (previously was on kaggle) funky…

tom · October 31, 2021, 5:16pm

Try to use the latest PyTorch (1.10).
The error indicates an out of bound memory access similar to a segfault on the CPU, something like an indexing error in the low level code.
You do reproduce this with the exact code you posted?

Best regards

Thomas

Oliver · October 31, 2021, 5:27pm

Yes, I think I at least now know what is causing this, though the error seems misleading.

I train a model (GAN) with a batch size larger than 4, I get this error. Unless I hard reset (factory reset the runtime), even after restarting the kernel and clearing all caches will cause anything related with the GPU to output this error.
It’s not just the code above, just using .to(‘cuda’) will suffice.
It’s funky, but I’m also just really learning pytorch now.
Maybe it’s because I keep the image arrays in memory (about 700 MB total) instead of loading each image in in batches?

tom · October 31, 2021, 5:38pm

So these CUDA errors should be fatal for the Python process they happen in but restarting the kernel should then make things work again.
The perhaps most typical way to trigger it in a working PyTorch system is to have classification targets that exceed that possible through the logit tensor size passed to CrossEntropyLoss. But something seems funny with your system, then.
At this point, maybe @ptrblck knows more…

Best regards

Thomas

Oliver · October 31, 2021, 6:00pm

It’s weird, though at this point it’s two seperate systems kaggle and colab.
At least it works now with running on smaller batch sizes, in the next few days I may try to just load from disk.

Oliver · October 31, 2021, 6:00pm

Either way, thanks for looking into it with me, it’s very much appreciated!

ptrblck · October 31, 2021, 8:15pm

Do you see the illegal memory access when running the original GAN training from a clean and working environment?
If so, could you post an executable code snippet, which would reproduce the issue and post the output of python -m torch.utils.collect_env, please?

Does this minimal conv/cuDNN code snippet also yield the illegal memory access in a clean environment or only after you’ve already hit the previous error in the GAN training?

Could you explain how you are performing the “factory reset”?

Yes, that’s correct and if you are running into asserts (such as in the nn.CrossEntropyLoss use case) the CUDA context would be corrupted and restarting the Python kernel should work.

Oliver · November 1, 2021, 5:31am

Hi, thanks

To 1.:

Collecting environment information...
PyTorch version: 1.9.0+cu111
Is debug build: False
CUDA used to build PyTorch: 11.1
ROCM used to build PyTorch: N/A

OS: Ubuntu 18.04.5 LTS (x86_64)
GCC version: (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
Clang version: 6.0.0-1ubuntu2 (tags/RELEASE_600/final)
CMake version: version 3.12.0
Libc version: glibc-2.26

Python version: 3.7 (64-bit runtime)
Python platform: Linux-5.4.104+-x86_64-with-Ubuntu-18.04-bionic
Is CUDA available: True
CUDA runtime version: 11.1.105
GPU models and configuration: GPU 0: Tesla P100-PCIE-16GB
Nvidia driver version: 460.32.03
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.7.6.5
/usr/lib/x86_64-linux-gnu/libcudnn.so.8.0.5
/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.0.5
/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.0.5
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.0.5
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.0.5
/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.0.5
/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.0.5
HIP runtime version: N/A
MIOpen runtime version: N/A

Versions of relevant libraries:
[pip3] numpy==1.19.5
[pip3] torch==1.9.0+cu111
[pip3] torchsummary==1.5.1
[pip3] torchtext==0.10.0
[pip3] torchvision==0.10.0+cu111
[conda] Could not collect

to 2 and 3)
This is confusing to me, restarting the kernel, etc, I believe should result in a clean environment. Maybe that is not the case on Kaggle/Colab or my knowledge is just too poor.
However, after restarting the error persists on both platforms. On Kaggle there is no fix unless to completely shut off and turn on (i’m guessing this switches the gpu too)
Restarting the kernel/runtime doesn’t help and henceforth anything will result in this error that has a call to GPU.

On Colab there is an option under Runtime called Factory Reset Runtime. It clears everything.
This works.

So yes, I do encounter it in a clean environment but only after I hit the previous error in a previous session even after restarting.

If I start with batch size <= 4 all is fine.

ptrblck · November 1, 2021, 8:19am

Thanks for the follow-up. I’m not familiar with how Kaggle kernels work, but based on your description it doesn’t seem to be sufficient to restart the kernel.
In any case, once you’ve “completely shut off” the Kaggle runtime, what kind of error are you seeing when running the code the first time?

Oliver · November 1, 2021, 8:41am

Thanks for helping!
Sorry I was unclear.
If I completely shut off and start I likely get a new gpu assigned, no error. Like colab’s factory reset.

Could it be the issue that I’m storing the whole array in memory?
That’s the only major thing that is different this time arround, I repurposed old code I made which worked fine but loaded images from disk.
I’ll try that next with this dataset too.

Edit Update:
Get the same with reading from disk, weird.
I think I have issue in my code.
Above it points at:
allow_unreachable=True, accumulate_grad=True) # allow_unreachable flag

something similar in tensorflow woudl mean the graph is disconnected.
What I can’t make out is how this relates to batch_size

Edit update 2:
Actually can do up to batch_size 6

Oliver · November 4, 2021, 2:08pm

Update 3: I still get the error
Also, batch_size 6 sometimes does it (works for a few epochs I think and eventually fails or maybe it’s really just sometimes)
This is so weird.

To circle back to my initial suspicions
Tensorflow has a similar error msg and it would mean that the graph is disconnected.
Do you think this could be the case?
How does batch size fit into that?

This is where it happens:

<ipython-input-21-5f2cf50439c6> in train(save_model)
     34             gen_loss = get_gen_loss(gen, disc, mask, image, adv_criterion, recon_criterion, 1000)
     35             gen_loss.backward()
---> 36             gen_opt.step()
     37 
     38             # Keep track of the average discriminator loss

/usr/local/lib/python3.7/dist-packages/torch/optim/optimizer.py in wrapper(*args, **kwargs)
     86                 profile_name = "Optimizer.step#{}.step".format(obj.__class__.__name__)
     87                 with torch.autograd.profiler.record_function(profile_name):
---> 88                     return func(*args, **kwargs)
     89             return wrapper
     90 

/usr/local/lib/python3.7/dist-packages/torch/autograd/grad_mode.py in decorate_context(*args, **kwargs)
     26         def decorate_context(*args, **kwargs):
     27             with self.__class__():
---> 28                 return func(*args, **kwargs)
     29         return cast(F, decorate_context)
     30 

/usr/local/lib/python3.7/dist-packages/torch/optim/adam.py in step(self, closure)
    116                    lr=group['lr'],
    117                    weight_decay=group['weight_decay'],
--> 118                    eps=group['eps'])
    119         return loss

/usr/local/lib/python3.7/dist-packages/torch/optim/_functional.py in adam(params, grads, exp_avgs, exp_avg_sqs, max_exp_avg_sqs, state_steps, amsgrad, beta1, beta2, lr, weight_decay, eps)
     85         # Decay the first and second moment running average coefficient
     86         exp_avg.mul_(beta1).add_(grad, alpha=1 - beta1)
---> 87         exp_avg_sq.mul_(beta2).addcmul_(grad, grad, value=1 - beta2)
     88         if amsgrad:
     89             # Maintains the maximum of all 2nd moment running avg. till no

w

Thanks a lot, guys, it works with batch_size 4 but not knowing what is causing this makes me nuts.

ptrblck · November 5, 2021, 6:55am

No, I don’t think so. Based on your description it rather sounds as if you are running out of memory and then getting false errors due to a sticky error. However, I’m still unsure which error message you get the first time after running your code in a clean and working environment.

Oliver · November 5, 2021, 7:59am

Oh, I’m sorry, I seem to have misunderstood what you meant (asked about) earlier.
The first time I run it with a batch_size > 4 gives me this error:

RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

It stays the same for all gpu related calls unless factory reset in colab even when restarting the kernel.

ptrblck · November 5, 2021, 8:35am

OK, thanks. Could you post an executable code snippet to reproduce the issue, please?
In case you are using a custom dataset, please post the input shapes which would be needed to execute the training and run into the issue.

tanweer-mahdi · March 29, 2022, 9:29pm

I am getting the same error within a Kaggle kernel. I was able to train the model successfully, however, during inference I receive this error. I have noticed that substantially lowering the number of epochs does not throw the error but I cannot proceed with this approach because the model has way too high bias with that few number of epochs.

I used torch.cuda.empty_cache() thinking that it is possibly originating from excessive memory usage. This is the result of !nvidia-smi post that:

Tue Mar 29 11:33:01 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.119.04   Driver Version: 450.119.04   CUDA Version: 11.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla P100-PCIE...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   40C    P0    37W / 250W |    997MiB / 16280MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+

Is there any solution to these error? I am participating in a competition but have not been able to submit any prediction in days due to this error. Many thanks.

ptrblck · March 29, 2022, 10:57pm

An illegal memory access error won’t be raised if you are running out of memory.
One recommended approach is to update PyTorch to the latest release with the latest library stack and to check if this was a known and already fixed issue. Currently it would be the nightly binaries with the cUDA 11.5 runtime.