PyTorch OOM with one-element CUDA Tensor while a GPU card is free after killing all python processes

jianchao-li · October 17, 2018, 6:44am

Recently I ran into a weird problem when using PyTorch. I have 8 GPU cards in the machine. After running a PyTorch training program for some time, I stopped it by Ctrl+C and then I checked the cards using nvidia-smi. Everything looked good.

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 390.46                 Driver Version: 390.46                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla V100-PCIE...  Off  | 00000000:1A:00.0 Off |                    0 |
| N/A   32C    P0    25W / 250W |     11MiB / 16160MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-PCIE...  Off  | 00000000:1F:00.0 Off |                    0 |
| N/A   34C    P0    25W / 250W |     11MiB / 16160MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  Tesla V100-PCIE...  Off  | 00000000:20:00.0 Off |                    0 |
| N/A   33C    P0    25W / 250W |     11MiB / 16160MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  Tesla V100-PCIE...  Off  | 00000000:21:00.0 Off |                    0 |
| N/A   33C    P0    23W / 250W |     11MiB / 16160MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   4  Tesla V100-PCIE...  Off  | 00000000:B2:00.0 Off |                    0 |
| N/A   32C    P0    26W / 250W |     11MiB / 16160MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   5  Tesla V100-PCIE...  Off  | 00000000:B3:00.0 Off |                    0 |
| N/A   35C    P0    26W / 250W |     11MiB / 16160MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   6  Tesla V100-PCIE...  Off  | 00000000:B4:00.0 Off |                    0 |
| N/A   34C    P0    25W / 250W |     11MiB / 16160MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   7  Tesla V100-PCIE...  Off  | 00000000:B5:00.0 Off |                    0 |
| N/A   35C    P0    25W / 250W |     11MiB / 16160MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

I also wrote a check.cu to check the GPU memory.

#include <iostream>
#include "cuda.h"
#include "cuda_runtime_api.h"
  
using namespace std;
  
int main( void ) {
    int num_gpus;
    size_t free, total;
    cudaGetDeviceCount( &num_gpus );
    for ( int gpu_id = 0; gpu_id < num_gpus; gpu_id++ ) {
        cudaSetDevice( gpu_id );
        int id;
        cudaGetDevice( &id );
        cudaMemGetInfo( &free, &total );
        cout << "GPU " << id << " memory: free=" << free << ", total=" << total << endl;
    }
    return 0;
}

The output also looked good.

GPU 0 memory: free=16488464384, total=16945512448
GPU 1 memory: free=16488464384, total=16945512448
GPU 2 memory: free=16488464384, total=16945512448
GPU 3 memory: free=16488464384, total=16945512448
GPU 4 memory: free=16488464384, total=16945512448
GPU 5 memory: free=16488464384, total=16945512448
GPU 6 memory: free=16488464384, total=16945512448
GPU 7 memory: free=16488464384, total=16945512448

Then I moved forward to try to create a one-element CUDA Tensor. And an OOM error happened.

import torch
import numpy as np
 
if __name__ == '__main__':
    x = np.random.randn(1)
    try:
        t = torch.cuda.FloatTensor(x)
        print('Success!')
    except Exception as e:
        print(e)

GPU 2 seemed to be OOM.

$ CUDA_VISIBLE_DEVICES=0 python3 check.py 
Success!
$ CUDA_VISIBLE_DEVICES=1 python3 check.py 
Success!
$ CUDA_VISIBLE_DEVICES=2 python3 check.py 
CUDA error: out of memory
$ CUDA_VISIBLE_DEVICES=3 python3 check.py 
Success!
$ CUDA_VISIBLE_DEVICES=4 python3 check.py 
Success!
$ CUDA_VISIBLE_DEVICES=5 python3 check.py 
Success!
$ CUDA_VISIBLE_DEVICES=6 python3 check.py 
Success!
$ CUDA_VISIBLE_DEVICES=7 python3 check.py 
Success!

I followed @smth’s suggestion in this reply and killed all the python processes.

But the above problem still occurred. I tried to reinstall PyTorch and it did not fix the problem. I once tried to restart the machine and it worked. But it would not be feasible to restart the machine every time since I was running it on a server…

Could anyone please shed some lights on this problem? By the way, I was using PyTorch 0.4.1 and CUDA 9.0. And the program I ran was semantic-segmentation-pytorch.

SimonW · October 17, 2018, 7:19am

There could have been a bug on uppercase Tensor ctor in 0.4.1 that is now fixed on master. Any case, you shouldn’t use the upper case ctors. Could you try torch.randn(1, device='cuda') instead and see if you still OOM?

It’s not related to zombie processes. Otherwise it will show up in nvidia-smi as occupied memory.

jianchao-li · October 17, 2018, 8:21am

Hello, @SimonW. Thank you for your reply.

So I changed the codes to be as follows.

import torch
 
if __name__ == '__main__':
    try:
        t = torch.randn(1, device='cuda')
        print('Success!')
    except Exception as e:
        print(e)

It still said OOM on GPU 2.

$ CUDA_VISIBLE_DEVICES=2 python3 check.py 
CUDA error: out of memory

SimonW · October 17, 2018, 11:55pm

Looks like there is something wrong with your CUDA? I don’t know. It’s unlikely a pytorch problem.

jianchao-li · October 26, 2018, 8:40am

Hi. Thank you for your reply. I also tried to initialize some data on that GPU card using Tensorflow and it also complained about OOM.

@ngimel Could you please shed some lights on this issue? Thank you.

Preetham_Gali · July 15, 2021, 8:53am

I am so facing same issue, @jianchao-li did you find any solution other than restarting?