@iffiX @mrshenli I just got time to test the gloo
backend. It seems that the training could be run without significant problems. However, I do have concerns. I found the number of processes is 7 on each node despite the fact that I requested using 4 GPU on each node.
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 53447 C /opt/conda/bin/python 1511MiB |
| 0 53448 C /opt/conda/bin/python 803MiB |
| 0 53449 C /opt/conda/bin/python 803MiB |
| 0 53450 C /opt/conda/bin/python 803MiB |
| 1 53448 C /opt/conda/bin/python 1511MiB |
| 2 53449 C /opt/conda/bin/python 1511MiB |
| 3 53450 C /opt/conda/bin/python 1511MiB |
+-----------------------------------------------------------------------------+
The GPU memory usages are not even as well.
$ nvidia-smi
Tue Jul 21 19:49:09 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.126.02 Driver Version: 418.126.02 CUDA Version: 10.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla V100-SXM2... On | 00000000:06:00.0 Off | 0 |
| N/A 38C P0 49W / 163W | 3933MiB / 32480MiB | 11% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla V100-SXM2... On | 00000000:07:00.0 Off | 0 |
| N/A 38C P0 46W / 163W | 1522MiB / 32480MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 2 Tesla V100-SXM2... On | 00000000:0A:00.0 Off | 0 |
| N/A 38C P0 46W / 163W | 1522MiB / 32480MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 3 Tesla V100-SXM2... On | 00000000:0B:00.0 Off | 0 |
| N/A 37C P0 48W / 163W | 1522MiB / 32480MiB | 9% Default |
+-------------------------------+----------------------+----------------------+
| 4 Tesla V100-SXM2... On | 00000000:85:00.0 Off | 0 |
| N/A 36C P0 42W / 163W | 11MiB / 32480MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 5 Tesla V100-SXM2... On | 00000000:86:00.0 Off | 0 |
| N/A 38C P0 43W / 163W | 11MiB / 32480MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 6 Tesla V100-SXM2... On | 00000000:89:00.0 Off | 0 |
| N/A 38C P0 43W / 163W | 11MiB / 32480MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 7 Tesla V100-SXM2... On | 00000000:8A:00.0 Off | 0 |
| N/A 37C P0 41W / 163W | 11MiB / 32480MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
Can you guys explain what’s happening here?
Regarding the nccl
backend problem, I currently don’t have time to troubleshoot at more lower level. But I believe it is a bug, either in the nccl
library or in the PyTorch implementation.
Thank you.
Best,
Lei