CUDA out of memory with abundant memory

Dzhange · August 11, 2020, 9:35am

Hi

I’m running on a machine with 8 V100, but it seems when I try to run 2 training process in parallel, the following error shows up:
RuntimeError: CUDA error: out of memory

Full log be like

THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1579022034529/work/aten/src/THC/THCCachingHostAllocator.cpp line=278 error=2 : out of memory
[>--------------------------------------------------] epoch - 3/102, train loss - 0.57964987 | epoch - 2 s, total - 2 s ETA - 12 d 1 h 36 m 16 s |   Traceback (most recent call last):
  File "/workspace/nrnocs_dev/models/trainer.py", line 41, in train
    self.model.optimizer.zero_grad()
  File "/root/miniconda/envs/py36/lib/python3.6/site-packages/torch/optim/optimizer.py", line 165, in zero_grad
    p.grad.zero_()
RuntimeError: CUDA error: out of memory

But I’m training them on 2 cards, which are both not in use.

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.129      Driver Version: 410.129      CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  Off  | 00000000:06:00.0 Off |                    0 |
| N/A   45C    P0    90W / 300W |  32287MiB / 32480MiB |     77%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-SXM2...  Off  | 00000000:07:00.0 Off |                    0 |
| N/A   63C    P0   278W / 300W |   4083MiB / 32480MiB |     96%      Default |
+-------------------------------+----------------------+----------------------+
|   2  Tesla V100-SXM2...  Off  | 00000000:0A:00.0 Off |                    0 |
| N/A   39C    P0    47W / 300W |     11MiB / 32480MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  Tesla V100-SXM2...  Off  | 00000000:0B:00.0 Off |                    0 |
| N/A   40C    P0    60W / 300W |    454MiB / 32480MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   4  Tesla V100-SXM2...  Off  | 00000000:85:00.0 Off |                    0 |
| N/A   40C    P0    61W / 300W |    452MiB / 32480MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   5  Tesla V100-SXM2...  Off  | 00000000:86:00.0 Off |                    0 |
| N/A   41C    P0    59W / 300W |    452MiB / 32480MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   6  Tesla V100-SXM2...  Off  | 00000000:89:00.0 Off |                    0 |
| N/A   41C    P0    59W / 300W |    452MiB / 32480MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   7  Tesla V100-SXM2...  Off  | 00000000:8A:00.0 Off |                    0 |
| N/A   38C    P0    45W / 300W |     11MiB / 32480MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+

Also, the RAM memory seems fine too.

total        used        free      shared  buff/cache   available 
    
Mem:           503G         58G        2.7G         16G        443G        426G

Swap:            0B          0B          0B

Anyone knows why?
Thanks!

ptrblck · August 13, 2020, 8:10am

Could you explain this statement a bit? How are you training them on two cards and why are they not used?
Are you using the env var CUDA_VISIBLE_DEVICES or are you selecting the device directly in the script?

Dzhange · August 26, 2020, 10:58am

Hi, sorry for the late reply.
I thought they were not use according to nvidia-smi displayed.
It seems that when I disabled the pin_memory flag in the dataloader it worked again.