Hi
I’m running on a machine with 8 V100, but it seems when I try to run 2 training process in parallel, the following error shows up:
RuntimeError: CUDA error: out of memory
Full log be like
THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1579022034529/work/aten/src/THC/THCCachingHostAllocator.cpp line=278 error=2 : out of memory
[>--------------------------------------------------] epoch - 3/102, train loss - 0.57964987 | epoch - 2 s, total - 2 s ETA - 12 d 1 h 36 m 16 s | Traceback (most recent call last):
File "/workspace/nrnocs_dev/models/trainer.py", line 41, in train
self.model.optimizer.zero_grad()
File "/root/miniconda/envs/py36/lib/python3.6/site-packages/torch/optim/optimizer.py", line 165, in zero_grad
p.grad.zero_()
RuntimeError: CUDA error: out of memory
But I’m training them on 2 cards, which are both not in use.
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.129 Driver Version: 410.129 CUDA Version: 10.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla V100-SXM2... Off | 00000000:06:00.0 Off | 0 |
| N/A 45C P0 90W / 300W | 32287MiB / 32480MiB | 77% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla V100-SXM2... Off | 00000000:07:00.0 Off | 0 |
| N/A 63C P0 278W / 300W | 4083MiB / 32480MiB | 96% Default |
+-------------------------------+----------------------+----------------------+
| 2 Tesla V100-SXM2... Off | 00000000:0A:00.0 Off | 0 |
| N/A 39C P0 47W / 300W | 11MiB / 32480MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 3 Tesla V100-SXM2... Off | 00000000:0B:00.0 Off | 0 |
| N/A 40C P0 60W / 300W | 454MiB / 32480MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 4 Tesla V100-SXM2... Off | 00000000:85:00.0 Off | 0 |
| N/A 40C P0 61W / 300W | 452MiB / 32480MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 5 Tesla V100-SXM2... Off | 00000000:86:00.0 Off | 0 |
| N/A 41C P0 59W / 300W | 452MiB / 32480MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 6 Tesla V100-SXM2... Off | 00000000:89:00.0 Off | 0 |
| N/A 41C P0 59W / 300W | 452MiB / 32480MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 7 Tesla V100-SXM2... Off | 00000000:8A:00.0 Off | 0 |
| N/A 38C P0 45W / 300W | 11MiB / 32480MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
+-----------------------------------------------------------------------------+
Also, the RAM memory seems fine too.
total used free shared buff/cache available
Mem: 503G 58G 2.7G 16G 443G 426G
Swap: 0B 0B 0B
Anyone knows why?
Thanks!