`torch.distributed.barrier` used in multi-node distributed data-parallel training

@iffiX @mrshenli I just got time to test the gloo backend. It seems that the training could be run without significant problems. However, I do have concerns. I found the number of processes is 7 on each node despite the fact that I requested using 4 GPU on each node.

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0     53447      C   /opt/conda/bin/python                       1511MiB |
|    0     53448      C   /opt/conda/bin/python                        803MiB |
|    0     53449      C   /opt/conda/bin/python                        803MiB |
|    0     53450      C   /opt/conda/bin/python                        803MiB |
|    1     53448      C   /opt/conda/bin/python                       1511MiB |
|    2     53449      C   /opt/conda/bin/python                       1511MiB |
|    3     53450      C   /opt/conda/bin/python                       1511MiB |
+-----------------------------------------------------------------------------+

The GPU memory usages are not even as well.

$ nvidia-smi
Tue Jul 21 19:49:09 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.126.02   Driver Version: 418.126.02   CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  On   | 00000000:06:00.0 Off |                    0 |
| N/A   38C    P0    49W / 163W |   3933MiB / 32480MiB |     11%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-SXM2...  On   | 00000000:07:00.0 Off |                    0 |
| N/A   38C    P0    46W / 163W |   1522MiB / 32480MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  Tesla V100-SXM2...  On   | 00000000:0A:00.0 Off |                    0 |
| N/A   38C    P0    46W / 163W |   1522MiB / 32480MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  Tesla V100-SXM2...  On   | 00000000:0B:00.0 Off |                    0 |
| N/A   37C    P0    48W / 163W |   1522MiB / 32480MiB |      9%      Default |
+-------------------------------+----------------------+----------------------+
|   4  Tesla V100-SXM2...  On   | 00000000:85:00.0 Off |                    0 |
| N/A   36C    P0    42W / 163W |     11MiB / 32480MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   5  Tesla V100-SXM2...  On   | 00000000:86:00.0 Off |                    0 |
| N/A   38C    P0    43W / 163W |     11MiB / 32480MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   6  Tesla V100-SXM2...  On   | 00000000:89:00.0 Off |                    0 |
| N/A   38C    P0    43W / 163W |     11MiB / 32480MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   7  Tesla V100-SXM2...  On   | 00000000:8A:00.0 Off |                    0 |
| N/A   37C    P0    41W / 163W |     11MiB / 32480MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

Can you guys explain what’s happening here?

Regarding the nccl backend problem, I currently don’t have time to troubleshoot at more lower level. But I believe it is a bug, either in the nccl library or in the PyTorch implementation.

Thank you.

Best,

Lei