I’m trying to run a test code on GPU of a remote machine. The code is
foo = torch.tensor([1,2,3])
foo = foo.to('cuda')
I’m getting the following error
Traceback (most recent call last):
File “/remote/blade/test.py”, line 3, in
foo = foo.to(‘cuda’)
RuntimeError: CUDA error: out of memory
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
From this discussion, the conflict between cuda and pytorch versions may be the cause for the error. I run the following
print('python v. : ', sys.version)
print('pytorch v. :', torch.__version__)
print('cuda v. :', torch.version.cuda)
to get the versions:
python v. : 3.9.7 (default, Sep 16 2021, 13:09:58)
pytorch v. : 1.11.0.dev20211206
cuda v. : 10.2
Does anything here look off?
The out of memory error might be wrongly returned and it seems your setup is unable to use the GPU.
Were you able to run PyTorch workloads on the GPU on this system before? If so, what did you change?
Also, which GPU, NVIDIA driver, OS etc. are you using?
This is a remote research computing machine and I used to be able to run on GPUs before. I open an interactive session using
qrsh -q gpu -l gpu_card=1 -pe smp 1
GPU: Tesla V100 PCIe 32GB
NVIDIA driver: version 460.80
OS: Red Hat Enterprise Linux Server release 7.9 (Maipo)
It’s hard to tell what might be the root cause, as it was working before and I don’t know what has changed.
I don’t believe there is a PyTorch-related fix, as the error points towards a setup issue.
You could try to test other environments (e.g. create a new conda env, reinstall PyTorch, and check if it’s working), try to run CUDA samples by rebuilding them and executing etc.
For the folks having a similar issue, the issue in my case was caused by conflicting python libraries. I removed the python directory outside of conda and it fixed the issue.