Recently I ran into a weird problem when using PyTorch. I have 8 GPU cards in the machine. After running a PyTorch training program for some time, I stopped it by Ctrl+C and then I checked the cards using nvidia-smi
. Everything looked good.
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 390.46 Driver Version: 390.46 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla V100-PCIE... Off | 00000000:1A:00.0 Off | 0 |
| N/A 32C P0 25W / 250W | 11MiB / 16160MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla V100-PCIE... Off | 00000000:1F:00.0 Off | 0 |
| N/A 34C P0 25W / 250W | 11MiB / 16160MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 2 Tesla V100-PCIE... Off | 00000000:20:00.0 Off | 0 |
| N/A 33C P0 25W / 250W | 11MiB / 16160MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 3 Tesla V100-PCIE... Off | 00000000:21:00.0 Off | 0 |
| N/A 33C P0 23W / 250W | 11MiB / 16160MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 4 Tesla V100-PCIE... Off | 00000000:B2:00.0 Off | 0 |
| N/A 32C P0 26W / 250W | 11MiB / 16160MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 5 Tesla V100-PCIE... Off | 00000000:B3:00.0 Off | 0 |
| N/A 35C P0 26W / 250W | 11MiB / 16160MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 6 Tesla V100-PCIE... Off | 00000000:B4:00.0 Off | 0 |
| N/A 34C P0 25W / 250W | 11MiB / 16160MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 7 Tesla V100-PCIE... Off | 00000000:B5:00.0 Off | 0 |
| N/A 35C P0 25W / 250W | 11MiB / 16160MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
I also wrote a check.cu
to check the GPU memory.
#include <iostream>
#include "cuda.h"
#include "cuda_runtime_api.h"
using namespace std;
int main( void ) {
int num_gpus;
size_t free, total;
cudaGetDeviceCount( &num_gpus );
for ( int gpu_id = 0; gpu_id < num_gpus; gpu_id++ ) {
cudaSetDevice( gpu_id );
int id;
cudaGetDevice( &id );
cudaMemGetInfo( &free, &total );
cout << "GPU " << id << " memory: free=" << free << ", total=" << total << endl;
}
return 0;
}
The output also looked good.
GPU 0 memory: free=16488464384, total=16945512448
GPU 1 memory: free=16488464384, total=16945512448
GPU 2 memory: free=16488464384, total=16945512448
GPU 3 memory: free=16488464384, total=16945512448
GPU 4 memory: free=16488464384, total=16945512448
GPU 5 memory: free=16488464384, total=16945512448
GPU 6 memory: free=16488464384, total=16945512448
GPU 7 memory: free=16488464384, total=16945512448
Then I moved forward to try to create a one-element CUDA Tensor. And an OOM error happened.
import torch
import numpy as np
if __name__ == '__main__':
x = np.random.randn(1)
try:
t = torch.cuda.FloatTensor(x)
print('Success!')
except Exception as e:
print(e)
GPU 2 seemed to be OOM.
$ CUDA_VISIBLE_DEVICES=0 python3 check.py
Success!
$ CUDA_VISIBLE_DEVICES=1 python3 check.py
Success!
$ CUDA_VISIBLE_DEVICES=2 python3 check.py
CUDA error: out of memory
$ CUDA_VISIBLE_DEVICES=3 python3 check.py
Success!
$ CUDA_VISIBLE_DEVICES=4 python3 check.py
Success!
$ CUDA_VISIBLE_DEVICES=5 python3 check.py
Success!
$ CUDA_VISIBLE_DEVICES=6 python3 check.py
Success!
$ CUDA_VISIBLE_DEVICES=7 python3 check.py
Success!
I followed @smth’s suggestion in this reply and killed all the python processes.
But the above problem still occurred. I tried to reinstall PyTorch and it did not fix the problem. I once tried to restart the machine and it worked. But it would not be feasible to restart the machine every time since I was running it on a server…
Could anyone please shed some lights on this problem? By the way, I was using PyTorch 0.4.1 and CUDA 9.0. And the program I ran was semantic-segmentation-pytorch.