Getting "RuntimeError: CUDA error: out of memory" when memory is free

blade · December 7, 2021, 5:41pm

I’m trying to run a test code on GPU of a remote machine. The code is

import torch

foo = torch.tensor([1,2,3])
foo = foo.to('cuda')

I’m getting the following error

Traceback (most recent call last):
File “/remote/blade/test.py”, line 3, in
foo = foo.to(‘cuda’)
RuntimeError: CUDA error: out of memory
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

From this discussion, the conflict between cuda and pytorch versions may be the cause for the error. I run the following

print('python v. : ', sys.version)
print('pytorch v. :', torch.__version__)
print('cuda v. :', torch.version.cuda)

to get the versions:

python v. : 3.9.7 (default, Sep 16 2021, 13:09:58)
[GCC 7.5.0]
pytorch v. : 1.11.0.dev20211206
cuda v. : 10.2
Does anything here look off?

ptrblck · December 8, 2021, 7:11am

The out of memory error might be wrongly returned and it seems your setup is unable to use the GPU.
Were you able to run PyTorch workloads on the GPU on this system before? If so, what did you change?
Also, which GPU, NVIDIA driver, OS etc. are you using?

blade · December 8, 2021, 10:20am

This is a remote research computing machine and I used to be able to run on GPUs before. I open an interactive session using

qrsh -q gpu -l gpu_card=1 -pe smp 1

I’m using:

GPU: Tesla V100 PCIe 32GB
NVIDIA driver: version 460.80
OS: Red Hat Enterprise Linux Server release 7.9 (Maipo)

ptrblck · December 8, 2021, 10:24am

It’s hard to tell what might be the root cause, as it was working before and I don’t know what has changed.
I don’t believe there is a PyTorch-related fix, as the error points towards a setup issue.
You could try to test other environments (e.g. create a new conda env, reinstall PyTorch, and check if it’s working), try to run CUDA samples by rebuilding them and executing etc.

blade · December 11, 2021, 4:22pm

For the folks having a similar issue, the issue in my case was caused by conflicting python libraries. I removed the python directory outside of conda and it fixed the issue.