RuntimeError: cuDNN error: CUDNN_STATUS_NOT_INITIALIZED

Khawar_Islam · March 19, 2021, 1:35am

I am doing training and put the dataset inside the data folder. The Structure looks like this.

–data
-----mars
---------bbox_train
---------bbox_test
---------info

Many developers said that this is a label problem but I am not not sure because labels are in right place

Traceback (most recent call last):
Args:Namespace(arch='resnet50graphpoolparthyper', concat=False, dataset='mars', dropout=0.1, eval_step=100, evaluate=False, gamma=0.1, gpu_devices='0', height=256, htri_only=False, lr=0.0003, margin=0.3, max_epoch=800, nheads=8, nhid=512, num_instances=4, part1=4, part2=8, part3=2, pool='avg', pretrained_model='/home/jiyang/Workspace/Works/video-person-reid/3dconv-person-reid/pretrained_models/resnet-50-kinetics.pth', print_freq=80, save_dir='log_hypergraphsagepart', seed=1, seq_len=8, start_epoch=0, stepsize=200, test_batch=1, train_batch=32, use_cpu=False, warmup=True, weight_decay=0.0005, width=128, workers=4, xent_only=False)
==========
Currently using GPU 0
Initializing dataset mars
=> MARS loaded
Dataset statistics:
  ------------------------------
  subset   | # ids | # tracklets
  ------------------------------
  train    |   625 |     8298
  query    |   626 |     1980
  gallery  |   622 |     9330
  ------------------------------
  total    |  1251 |    19608
  number of images per tracklet: 2 ~ 920, average 59.5
  ------------------------------
Initializing model: resnet50graphpoolparthyper
Model size: 44.17957M
==> Epoch 1/800  lr:1.785e-05
Traceback (most recent call last):
  File "main_video_person_reid_hypergraphsage_part.py", line 357, in <module>
    main()
  File "main_video_person_reid_hypergraphsage_part.py", line 220, in main
    train(model, criterion_xent, criterion_htri, optimizer, trainloader, use_gpu)
  File "main_video_person_reid_hypergraphsage_part.py", line 257, in train
    outputs, features = model(imgs)
  File "/home/khawar/anaconda3/envs/hypergraph_reid/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/khawar/anaconda3/envs/hypergraph_reid/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 165, in forward
    return self.module(*inputs[0], **kwargs[0])
  File "/home/khawar/anaconda3/envs/hypergraph_reid/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/media/khawar/HDD_Khawar1/hypergraph_reid/models/ResNet_hypergraphsage_part.py", line 621, in forward
    x = self.base(x)
  File "/home/khawar/anaconda3/envs/hypergraph_reid/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/media/khawar/HDD_Khawar1/hypergraph_reid/models/resnet.py", line 213, in forward
    x = self.conv1(x)
  File "/home/khawar/anaconda3/envs/hypergraph_reid/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/khawar/anaconda3/envs/hypergraph_reid/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 399, in forward
    return self._conv_forward(input, self.weight, self.bias)
  File "/home/khawar/anaconda3/envs/hypergraph_reid/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 396, in _conv_forward
    self.padding, self.dilation, self.groups)
RuntimeError: cuDNN error: CUDNN_STATUS_NOT_INITIALIZED

ptrblck · March 19, 2021, 4:30am

Could you check this post and see, if you might be hitting the same issue (Turing GPU using the 1.8.0 pip wheels with CUDA10.2 runtime)?

Khawar_Islam · March 19, 2021, 8:11am

@ptrblck Thank you. I just install the below version and problem solved

pip install torch==1.8.0+cu111 torchvision==0.9.0+cu111 torchaudio==0.8.0 -f https://download.pytorch.org/whl/torch_stable.html

I would like to ask one question more now I tried different batch sizes but CUDA out of memory remain the same.

RuntimeError: CUDA out of memory. Tried to allocate 290.00 MiB (GPU 0; 7.79 GiB total capacity; 5.51 GiB already allocated; 292.81 MiB free; 5.61 GiB reserved in total by PyTorch)

I am using GeForce RTX 2070. Any hint please?

tlim · March 19, 2021, 8:43am

If it is not that your model/data is too big then it is because your GPU has not freed the memory.

Go to terminal → nvidia-smi → kill -9 PID

Select the PID of the processes that are taking up a lot of memory (it will be usually python).

Khawar_Islam · March 19, 2021, 10:21am

@tlim I already killed all process but still received same error.

tlim · March 19, 2021, 3:31pm

Just to confirm, run watch nvidia-smi. Ensure your GPU memory is near empty before u run the script and see what happens when u run it.

It could be that the model and data is in fact overloading the GPU, it happened to me when I tried running some 3D DL algo. Try batch_size=1, see how that goes.

Svengof · March 28, 2021, 1:13pm

Hello Guys!

I was facing the same issue and installing torch with CUDA11.1 solved it. But unfortunately it gives me a new error related to CUDA:

RuntimeError: CUDA driver initialization failed, you might not have a CUDA gpu.

When I run the command line nvidia-smi, I get:

Sun Mar 28 15:02:53 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.100      Driver Version: 440.100      CUDA Version: 10.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Quadro RTX 8000     Off  | 00000000:65:00.0 Off |                  Off |
| 33%   33C    P8     4W / 260W |      1MiB / 48601MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Quadro RTX 8000     Off  | 00000000:66:00.0 Off |                  Off |
| 33%   36C    P8     8W / 260W |    411MiB / 48598MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    1      1323      G   /usr/lib/xorg/Xorg                           392MiB |
|    1      1934      G   cinnamon                                      16MiB |
+-----------------------------------------------------------------------------+

I suppose the problem comes from the fact that I’m trying to use torch with CUDA 11.1 with an incompatible Driver Version, but I don’t know how to fix the problem. I’m working on a remote server and it is the first time I’m using GPUs. I believe that the legendary @ptrblck would be able to help me?

ptrblck · March 28, 2021, 7:39pm

The 1.8.1 release fixed the Turing issue, so you could simply update to it and the wheels and binaries should work.
In case you want to use CUDA11.1, you would have to update the drivers indeed, as 440.100 is too old.
Table 1 gives you an overview of the required versions.

FengZhiheng-coder · April 30, 2021, 8:02am

@ptrblck
I have taken the solution

pip install torch==1.8.0+cu111 torchvision==0.9.0+cu111 torchaudio==0.8.0 -f https://download.pytorch.org/whl/torch_stable.html

But there is the same error RuntimeError: cuDNN error: CUDNN_STATUS_NOT_INITIALIZED

Following is the result by run python -m torch.utils.collect_env

Collecting environment information...
PyTorch version: 1.8.0+cu111
Is debug build: False
CUDA used to build PyTorch: 11.1
ROCM used to build PyTorch: N/A

OS: Ubuntu 18.04.5 LTS (x86_64)
GCC version: (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
Clang version: Could not collect
CMake version: version 3.19.20210103-g3387789

Python version: 3.6 (64-bit runtime)
Is CUDA available: True
CUDA runtime version: Could not collect
GPU models and configuration: 
GPU 0: Quadro RTX 6000
GPU 1: Quadro RTX 6000
GPU 2: Quadro RTX 6000
GPU 3: Quadro RTX 6000

Nvidia driver version: 455.23.05
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.7.6.5
/usr/local/cuda-11.1/targets/x86_64-linux/lib/libcudnn.so.8.0.5
/usr/local/cuda-11.1/targets/x86_64-linux/lib/libcudnn_adv_infer.so.8.0.5
/usr/local/cuda-11.1/targets/x86_64-linux/lib/libcudnn_adv_train.so.8.0.5
/usr/local/cuda-11.1/targets/x86_64-linux/lib/libcudnn_cnn_infer.so.8.0.5
/usr/local/cuda-11.1/targets/x86_64-linux/lib/libcudnn_cnn_train.so.8.0.5
/usr/local/cuda-11.1/targets/x86_64-linux/lib/libcudnn_ops_infer.so.8.0.5
/usr/local/cuda-11.1/targets/x86_64-linux/lib/libcudnn_ops_train.so.8.0.5
HIP runtime version: N/A
MIOpen runtime version: N/A

Versions of relevant libraries:
[pip3] numpy==1.19.4
[pip3] pytorch-metric-learning==0.9.96
[pip3] torch==1.8.0+cu111
[pip3] torch-points-kernels==0.6.10
[pip3] torchaudio==0.8.0
[pip3] torchvision==0.9.0+cu111
[conda] Could not collect

ptrblck · April 30, 2021, 8:06am

Could you check, if you are running out of memory and reduce the batch size?
If this doesn’t help, could you post an executable code snippet to reproduce the issue?

FengZhiheng-coder · April 30, 2021, 8:36am

Thanks a lot.
The batch size is 1.
Because the executable code is in a project, which needs some data and special settings.I don’t know how to show key codes for the moment.
The error information as followed

Traceback (most recent call last):
  File "train.py", line 329, in <module>
    main()
  File "train.py", line 198, in main
    pos2_trans)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/data/pwclonet_cu90/models.py", line 224, in forward
    l0_points_f1)
  File "/data/pwclonet_cu90/pointconv_util.py", line 382, in forward
    new_points = conv(new_points)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/data/pwclonet_cu90/pointconv_util.py", line 73, in forward
    outputs = self.conv(x.to(torch.float32))
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/conv.py", line 399, in forward
    return self._conv_forward(input, self.weight, self.bias)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/conv.py", line 396, in _conv_forward
    self.padding, self.dilation, self.groups)
RuntimeError: cuDNN error: CUDNN_STATUS_NOT_INITIALIZED

FengZhiheng-coder · May 6, 2021, 2:19am

@ptrblck
I have found the reason for the error. I moved my model to gpu by model.cuda(). Then I created some new tensors in this model without moving them to gpu. But why the traceback indicate the error RuntimeError: cuDNN error: CUDNN_STATUS_NOT_INITIALIZED.

ptrblck · May 6, 2021, 6:02am

It shouldn’t raise this error and a device assertion should be raised instead indicating a device mismatch. Which PyTorch version are you using?

FengZhiheng-coder · May 6, 2021, 10:23am

I use the docker ufoym/deepo:latest.

PyTorch version: 1.6.0.dev20200609+cu101
Is debug build: No
CUDA used to build PyTorch: 10.1

OS: Ubuntu 18.04.3 LTS
GCC version: (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
CMake version: version 3.18.20200610-gc1b6ada

Python version: 3.6
Is CUDA available: Yes
CUDA runtime version: 10.1.243
GPU models and configuration: 
GPU 0: GeForce RTX 2080 Ti
GPU 1: GeForce RTX 2080 Ti
GPU 2: GeForce RTX 2080 Ti
GPU 3: GeForce RTX 2080 Ti

Nvidia driver version: 460.73.01
cuDNN version: /usr/lib/x86_64-linux-gnu/libcudnn.so.7.6.5

Versions of relevant libraries:
[pip3] numpy==1.18.5
[pip3] torch==1.6.0.dev20200609+cu101
[pip3] torchvision==0.7.0.dev20200609+cu101
[conda] Could not collect

ptrblck · May 6, 2021, 6:19pm

Could you update to the latest PyTorch version? I’m not sure where your 1.6.0 nighly binary would land, but if it’s between the 1.5.0 and 1.5.1 release, note that 1.5.0 accidentally removed device assert statements, so users could run into issues such as illegal memory accesses instead of indexing errors. This was fixed in 1.5.1 again.

Ajinkya_Ambatwar · May 13, 2021, 3:52pm

Hi,
I am using PASCAL GPUs (1080TI) with CUDA 10.2 and pytorch 1.8.1
I still faced this issue!
I tried the solution you mentioned, but I have the same error.

RuntimeError: cuDNN error: CUDNN_STATUS_NOT_INITIALIZED

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.
terminate called after throwing an instance of 'c10::Error'                                                               
  what():  CUDA error: an illegal memory access was encountered
Exception raised from create_event_internal at /pytorch/c10/cuda/CUDACachingAllocator.cpp:733 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7f6b5f6ca2f2 in /home/ajinkya/anaconda3/envs/env/lib/python3.6/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x5b (0x7f6b5f6c767b in /home/ajinkya/anaconda3/envs/env/lib/python3.6/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0x809 (0x7f6b5f9221f9 in /home/ajinkya/anaconda3/envs/env/lib/python3.6/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10::TensorImpl::release_resources() + 0x54 (0x7f6b5f6b23a4 in /home/ajinkya/anaconda3/envs/env/lib/python3.6/site-packages/torch/lib/libc10.so)
frame #4: <unknown function> + 0x6e9aea (0x7f6bd34dbaea in /home/ajinkya/anaconda3/envs/env/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
frame #5: <unknown function> + 0x6e9b91 (0x7f6bd34dbb91 in /home/ajinkya/anaconda3/envs/env/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
<omitting python frames>
frame #31: __libc_start_main + 0xe7 (0x7f6bfe8acbf7 in /lib/x86_64-linux-gnu/libc.so.6)

It looks like it has something to do with c10 as well?

ptrblck · May 13, 2021, 7:10pm

Thanks for the update. Could you post an executable code snippet so that we could reproduce the issue?

Ajinkya_Ambatwar · May 14, 2021, 2:23pm

Hello Sir,
I realized that this was due to my higher batch size. Reducing batch size from 32 to 16 removed this issue.
Thank you!

EDIT

I thought reducing batch size solved the issue, but I was wrong.
I reduced the batch size to 4 and ran my code. The code worked fine for initlal 4-5 batches and then runs to “notorious”
RuntimeError: CUDA error: an illegal memory access was encountered

I have a custom function built using cuda and pytorch cpp binding.
The custom function works well for initial few batches and then fails.

I don’t know if I will be able to share a workable snippet. But I have the library files and a test notebook which I can share in case you need to reproduce the error.

Thank you!

Ajinkya_Ambatwar · May 17, 2021, 11:33am

Hi @ptrblck I figured out my issue. It was something to do with the indexing in my cuda code. After debugging, the error was gone!

Thank you!!

UPDATE

I realized this is not related to my code at all. This error happens randomly on random epochs. It looks like something to do with cuda cache collector.
I am working on 3D point clouds where I am sampling the point cloud([1024,3]) in some number of points([512,3]). If I reduce the output number of points to a very low number (say [8,3]) the code works totally fine but my accuracy is significantly compromised.

I am working on a shared server. Is this issue has anything to do with available GPU memory?
@ptrblck

execptionerror · August 9, 2021, 5:29pm

Already Solved RuntimeError: cuDNN error: CUDNN_STATUS_NOT_INITIALIZED