Runtime errr: no valid conv algo available in CuDNN

hgong · June 21, 2024, 8:07pm

Hi,
I have been trying to solve this problem for several days now and it seems like no solution posted previously or anywhere else online can solve it yet. I am running on Pytorch version 1.13.1, CUDA version 11.7, CuDNN version 8.5.0.0 and torch.backends.cudnn.is_available() does return TRUE. I also tried on a simple torch Conv2d layer to see if this operation would work, which did work as well. I also tried reducing batch size as some indicated that VRAM out of memory may also be an issue, but it still doesn’t resolve this problem. Does anyone know how to solve this? My device right now is Nvidia L40S.

I! CuDNN (v8500) function cudnnDestroyTensorDescriptor() called:
i! Time: 2024-06-21T12:07:05.256012 (0d+0h+0m+28s since start)
i! Process=1199677; Thread=1199677; GPU=NULL; Handle=NULL; StreamId=NULL.

Traceback (most recent call last):
File “/home/hgong/MagicPaint/train_tiktok.py”, line 1466, in
main(args)
File “/home/hgong/MagicPaint/train_tiktok.py”, line 1216, in main
x = infer_model.get_first_stage_encoding(infer_model.encode_first_stage(image))
File “/home/hgong/anaconda3/envs/magicpose/lib/python3.9/site-packages/torch/autograd/grad_mode.py”, line 27, in decorate_context
return func(*args, **kwargs)
File “/home/hgong/MagicPaint/model_lib/ControlNet/ldm/models/diffusion/ddpm.py”, line 2112, in encode_first_stage
return self.first_stage_model.encode(x)
File “/home/hgong/MagicPaint/model_lib/ControlNet/ldm/models/autoencoder.py”, line 83, in encode
h = self.encoder(x)
File “/home/hgong/anaconda3/envs/magicpose/lib/python3.9/site-packages/torch/nn/modules/module.py”, line 1194, in _call_impl
return forward_call(*input, **kwargs)
File “/home/hgong/MagicPaint/model_lib/ControlNet/ldm/modules/diffusionmodules/model.py”, line 523, in forward
hs = [self.conv_in(x)]
File “/home/hgong/anaconda3/envs/magicpose/lib/python3.9/site-packages/torch/nn/modules/module.py”, line 1194, in _call_impl
return forward_call(*input, **kwargs)
File “/home/hgong/anaconda3/envs/magicpose/lib/python3.9/site-packages/torch/nn/modules/conv.py”, line 463, in forward
return self._conv_forward(input, self.weight, self.bias)
File “/home/hgong/anaconda3/envs/magicpose/lib/python3.9/site-packages/torch/nn/modules/conv.py”, line 459, in _conv_forward
return F.conv2d(input, weight, bias, self.stride,
RuntimeError: no valid convolution algorithms available in CuDNN
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 1199673) of binary: /home/hgong/anaconda3/envs/magicpose/bin/python
Traceback (most recent call last):
File “/home/hgong/anaconda3/envs/magicpose/bin/torchrun”, line 33, in
sys.exit(load_entry_point(‘torch==1.13.1’, ‘console_scripts’, ‘torchrun’)())
File “/home/hgong/anaconda3/envs/magicpose/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py”, line 346, in wrapper
return f(*args, **kwargs)
File “/home/hgong/anaconda3/envs/magicpose/lib/python3.9/site-packages/torch/distributed/run.py”, line 762, in main
run(args)
File “/home/hgong/anaconda3/envs/magicpose/lib/python3.9/site-packages/torch/distributed/run.py”, line 753, in run
elastic_launch(
File “/home/hgong/anaconda3/envs/magicpose/lib/python3.9/site-packages/torch/distributed/launcher/api.py”, line 132, in call
return launch_agent(self._config, self._entrypoint, list(args))
File “/home/hgong/anaconda3/envs/magicpose/lib/python3.9/site-packages/torch/distributed/launcher/api.py”, line 246, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

hgong · June 21, 2024, 8:07pm

To Add on,
I added these lines to get those error messages:
export CUDA_LAUNCH_BLOCKING=1
export CUDNN_LOGINFO_DBG=1
export CUDNN_LOGDEST_DBG=stdout

ptrblck · June 21, 2024, 8:46pm

Could you update PyTorch to the latest stable or nightly version and check if you are still seeing the issue, please?

hgong · June 25, 2024, 6:12pm

Sorry for the late reply. I haven’t updated to the new version yet. But without turning CUDA Lauch Blocking, I get the error message: RuntimeError: CUDA error: invalid argument
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

I was monitoring the gpu usage as well because some refer to this error message as an out of memory error. I saw that all 8 gpus would at some point spike 100% volatile gpu util one by one which soon after the program would terminate. For memory usage it would spike around 25-41G throughout the process. I was wondering if this could be the issue? And if so, would it be solved with the newer packages?

hgong · June 28, 2024, 9:30pm

Following up,
I have solved the problem by bascially updating all torch related packages and torch version to a newer version. Finally got to run the script with newer version of cuda, cudnn, and pytorch. I believe it was 11.8 cuda pytorch with 8.7.0.2 cudnn and torch 2.3.1. Also had to reinstall xformer and triton to get everythin worked.