NVIDIA L40S-48Q and "RuntimeError: CUDA error: operation not supported"

Chris_Palmer · November 8, 2024, 7:29pm

We have been allocated a brand new NVIDIA L40S-48Q in our research environment.

Unfortunately, when we try and use the GPU we get “RuntimeError: CUDA error: operation not supported”. We have tried switching the background CUDA version between 12.1 and 11.8 with an appropriately matched Pytorch installation. We have tried Pytorch 2.5, 2.6 nightly, and 2.3. We have the correct values in CUDA_HOME, PATH, and LD_LIBRARY_PATH, as others have suggested. We have tried Python versions ranging from 3.9 to 3.11.

The research environment has other GPUs available with 24GB of RAM, and they work with the same settings, so it seems to be a problem with the GPU itself.

Any suggestions?

ptrblck · November 8, 2024, 11:05pm

Could you post a minimal and executable code snippet to reproduce the issue, please?

Chris_Palmer · November 9, 2024, 11:36am

Thanks for replying @ptrblck

First of all I am displaying the environment - we have Cuda 2.1 installed (though nvidia-smi says 12.2 - I think because it was originally installed under 12.2).

(env39) xxx:~$ ls -lh /etc/alternatives/cuda
lrwxrwxrwx 1 root root 20 Nov 8 16:07 /etc/alternatives/cuda → /usr/local/cuda-12.1

(env39) xxx:~$ echo $CUDA_HOME
/usr/local/cuda

(env39) xxx:~$ echo $PATH
/usr/local/cuda:/usr/local/cuda/lib64:/home/unimelb.edu.au/xxx/anaconda3/envs/env39/bin:/home/unimelb.edu.au/xxx/anaconda3/condabin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin:/usr/lib/fastx/3/apps

(env39) xxx:~$ echo $LD_LIBRARY_PATH
/usr/local/cuda/lib64:

(env39) xxx:~$ python
Python 3.9.20 (main, Oct 3 2024, 07:27:41)

import torch
torch.cuda.is_available()
True

torch.zeros(1).cuda()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
RuntimeError: CUDA error: operation not supported
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

ptrblck · November 9, 2024, 2:38pm

CUDA 2.1 was released in Jan 2009 and does support Tesla compute capabilities only, so I doubt you have it installed.
In any case, the locally installed CUDA toolkit won’t matter unless you build PyTorch from source.

Since the very first call already fails, I would start by either running any other CUDA-enabled application to verify it can run or by reinstalling the NVIDIA driver.

Chris_Palmer · November 9, 2024, 5:48pm

Sorry, I mistyped - we have CUDA 12.1, not 2.1. Can you suggest another CUDA enabled app we can try?

I will have to ask the administrators of the system to re-install the NVIDIA driver, but it would be helpful to text it with some other app that ought to run…

ptrblck · November 10, 2024, 2:50pm

Try to compile a few CUDA samples and execute them.

Chris_Palmer · November 10, 2024, 5:25pm

Thanks, @ptrblck - this is not something I am familiar with doing, I just use GPU for NLP. Can you point me to the best resource for getting the compilation tools and samples? Or, can you suggest a Linux application I can download to see if it runs on CUDA?

ptrblck · November 11, 2024, 5:08pm

NVIDIA/cuda-samples might be a good start to check if your setup is able to compile and execute a few example applications.

Chris_Palmer · November 12, 2024, 10:06am

Thanks, @ptrblck

See a message from our network admin below. After he rebooted the server the problem went away. He thinks it was due to a CUDA context not being able to be created because the NVIDIA driver was unlicensed. Does this make sense to you?

I installed a GPU test software container and it was failing with an error that it was unable to create a CUDA context.

I noticed that the output of ‘nvidia-smi -q’ stated that the driver was unlicensed, so I rebooted and the error went away. Now the GPU diagnostic tool works without error.

Can you please try your software again and see if the license was the problem for you also?

The GPU diagnostic tool I used is here:

https://www.together.ai/blog/a-practitioners-guide-to-testing-and-running-large-gpu-clusters-for-training-generative-ai-models

ptrblck · November 12, 2024, 11:50pm

Good to hear it’s working now!
I’m not familiar with vGPUs and how the driver expects to be licensed, but the debugging steps sound valid and I’m glad your network admin was able to narrow down the issue by trying to create a simple CUDA workflow.