I have a program for model training and inference that relies on many customized functions, which must be executed in a pytorch1.5.0+cu9.2 environment. I set up an Ubuntu 18.04 environment in the cloud with an L40s GPU. Currently, I installed the CUDA driver at version 12.4. When I run the program, I get the following error message:
return torch.batch_norm(
RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED
So far, I have tried several approaches, including:
-
Downloading a lower version of the runtime toolkit (e.g., toolkit 10.2), but it still reports the above error. (P.S. I haven’t tried downloading the CUDA toolkit 9.2 because the official toolkit only supports up to Ubuntu 16.04.)
-
I found online that someone mentioned the issue might be due to different CUDA driver versions and that the toolkit library (cuDNN) is dynamically linked. I tried adjusting the linking order, but still encountered the same error.
-
Someone online mentioned that the L40s only supports CUDA 11.x or later. Could this be the reason why I cannot run pytorch1.5.0+cu9.2 on the L40s? I did try upgrading my program to run successfully on pytorch1.8.0+cu111, but since the program is tightly coupled with low-level hardware-specific syntax, I cannot upgrade it without permission. I would like to find an alternative solution that allows me to use the L40s to train models under pytorch1.5.0+cu9.2.