Can not use gpu with torch

ct2034 · February 16, 2022, 8:18pm

When I run any torch to work with the GPU, I always get this error:

Traceback (most recent call last):
File “”, line 1, in
RuntimeError: CUDA error: out of memory

For example, when running …

CUDA_LAUNCH_BLOCKING=1 usr/bin/python3 -c "import torch; x = torch.linspace(0, 1, 10, device=torch.device(\"cuda:0\"))

Even if i select a GPU that has definitely memory left …

nvidia-smi -i 3
Wed Feb 16 21:13:11 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   3  GeForce RTX 3090    On   | 00000000:61:00.0 Off |                  N/A |
| 30%   26C    P8    18W / 350W |     15MiB / 24268MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    3   N/A  N/A      5084      G   /usr/lib/xorg/Xorg                  4MiB |
|    3   N/A  N/A     11272      G   /usr/lib/xorg/Xorg                  4MiB |
|    3   N/A  N/A   2461850      G   /usr/lib/xorg/Xorg                  4MiB |
+-----------------------------------------------------------------------------+

When running
CUDA_VISIBLE_DEVICES=3 CUDA_LAUNCH_BLOCKING=1 /usr/bin/python3 -c "import torch; x = torch.linspace(0, 1, 10, device=torch.device(\"cuda:0\"))"

, I get:

Traceback (most recent call last):
File “”, line 1, in
File “/home/chenkel/.local/lib/python3.8/site-packages/torch/cuda/init.py”, line 214, in _lazy_init
torch._C._cuda_init()
RuntimeError: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 2: out of memory

My torch version is 1.10.0+cu113
Cuda and driver version you can see from nvidida-smi above.

eqy · February 16, 2022, 9:14pm

What output do you get with

CUDA_VISIBLE_DEVICES=3 CUDA_LAUNCH_BLOCKING=1 /usr/bin/python3 -c "import torch; x = torch.linspace(0, 1, 10, device='cuda')"

?

ct2034 · February 16, 2022, 9:33pm

It’s just the same…

Traceback (most recent call last):
File “”, line 1, in
File “/home/chenkel/.local/lib/python3.8/site-packages/torch/cuda/init .py”, line 214, in _lazy_init
torch._C._cuda_init()
RuntimeError: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 2: out of memory

AlphaBetaGamma96 · February 17, 2022, 12:04am

Could you try running pytorch with a different CUDA version?

From your nvidia-smi command it seems that your CUDA version of your drivers currently support 11.2 yet your current install of PyTorch is CUDA 11.3, so perhaps this could cause an issue? You can check the actual version of CUDA you’re using via nvcc --version if you’re working on Linux.

You could always create a new environment and try and install pytorch with CUDA 11.2 (or whatever version nvcc --version states) and see if that resolves the issue?

ct2034 · February 17, 2022, 10:54am

Okay, I know what the issue was: I had set ulimit -v 10000000 in my .bashrc.

I don’t know how this is related. But after removing it, all works fine. I can clearly reproduce this to have been the issue.