Hi,
I have been trying to solve this problem for several days now and it seems like no solution posted previously or anywhere else online can solve it yet. I am running on Pytorch version 1.13.1, CUDA version 11.7, CuDNN version 8.5.0.0 and torch.backends.cudnn.is_available() does return TRUE. I also tried on a simple torch Conv2d layer to see if this operation would work, which did work as well. I also tried reducing batch size as some indicated that VRAM out of memory may also be an issue, but it still doesn’t resolve this problem. Does anyone know how to solve this? My device right now is Nvidia L40S.
I! CuDNN (v8500) function cudnnDestroyTensorDescriptor() called:
i! Time: 2024-06-21T12:07:05.256012 (0d+0h+0m+28s since start)
i! Process=1199677; Thread=1199677; GPU=NULL; Handle=NULL; StreamId=NULL.
Traceback (most recent call last):
File “/home/hgong/MagicPaint/train_tiktok.py”, line 1466, in
main(args)
File “/home/hgong/MagicPaint/train_tiktok.py”, line 1216, in main
x = infer_model.get_first_stage_encoding(infer_model.encode_first_stage(image))
File “/home/hgong/anaconda3/envs/magicpose/lib/python3.9/site-packages/torch/autograd/grad_mode.py”, line 27, in decorate_context
return func(*args, **kwargs)
File “/home/hgong/MagicPaint/model_lib/ControlNet/ldm/models/diffusion/ddpm.py”, line 2112, in encode_first_stage
return self.first_stage_model.encode(x)
File “/home/hgong/MagicPaint/model_lib/ControlNet/ldm/models/autoencoder.py”, line 83, in encode
h = self.encoder(x)
File “/home/hgong/anaconda3/envs/magicpose/lib/python3.9/site-packages/torch/nn/modules/module.py”, line 1194, in _call_impl
return forward_call(*input, **kwargs)
File “/home/hgong/MagicPaint/model_lib/ControlNet/ldm/modules/diffusionmodules/model.py”, line 523, in forward
hs = [self.conv_in(x)]
File “/home/hgong/anaconda3/envs/magicpose/lib/python3.9/site-packages/torch/nn/modules/module.py”, line 1194, in _call_impl
return forward_call(*input, **kwargs)
File “/home/hgong/anaconda3/envs/magicpose/lib/python3.9/site-packages/torch/nn/modules/conv.py”, line 463, in forward
return self._conv_forward(input, self.weight, self.bias)
File “/home/hgong/anaconda3/envs/magicpose/lib/python3.9/site-packages/torch/nn/modules/conv.py”, line 459, in _conv_forward
return F.conv2d(input, weight, bias, self.stride,
RuntimeError: no valid convolution algorithms available in CuDNN
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 1199673) of binary: /home/hgong/anaconda3/envs/magicpose/bin/python
Traceback (most recent call last):
File “/home/hgong/anaconda3/envs/magicpose/bin/torchrun”, line 33, in
sys.exit(load_entry_point(‘torch==1.13.1’, ‘console_scripts’, ‘torchrun’)())
File “/home/hgong/anaconda3/envs/magicpose/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py”, line 346, in wrapper
return f(*args, **kwargs)
File “/home/hgong/anaconda3/envs/magicpose/lib/python3.9/site-packages/torch/distributed/run.py”, line 762, in main
run(args)
File “/home/hgong/anaconda3/envs/magicpose/lib/python3.9/site-packages/torch/distributed/run.py”, line 753, in run
elastic_launch(
File “/home/hgong/anaconda3/envs/magicpose/lib/python3.9/site-packages/torch/distributed/launcher/api.py”, line 132, in call
return launch_agent(self._config, self._entrypoint, list(args))
File “/home/hgong/anaconda3/envs/magicpose/lib/python3.9/site-packages/torch/distributed/launcher/api.py”, line 246, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: