Gloo launches multiple processes on GPU 0

Hi there,

I am noticing some peculiar behavior when training using gloo as my backend. I am running on an 8 GPU node and it seems like processes with ranks 0 through 7 are creating some footprint on GPU 0 when they shouldn’t? I am setting the device in each process with torch.cuda.set_device(rank). This is what the output from Nvidia-smi looks like.

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.39       Driver Version: 460.39       CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Quadro RTX 5000     On   | 00000000:1A:00.0 Off |                  Off |
| 33%   30C    P2    53W / 230W |  16115MiB / 16125MiB |     14%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Quadro RTX 5000     On   | 00000000:1C:00.0 Off |                  Off |
| 33%   35C    P2    61W / 230W |   7200MiB / 16125MiB |     17%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  Quadro RTX 5000     On   | 00000000:1D:00.0 Off |                  Off |
| 33%   37C    P2    58W / 230W |   7200MiB / 16125MiB |     16%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  Quadro RTX 5000     On   | 00000000:1E:00.0 Off |                  Off |
| 33%   35C    P2    57W / 230W |   7200MiB / 16125MiB |     16%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   4  Quadro RTX 5000     On   | 00000000:3D:00.0 Off |                  Off |
| 33%   31C    P2    49W / 230W |   7200MiB / 16125MiB |     15%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   5  Quadro RTX 5000     On   | 00000000:3F:00.0 Off |                  Off |
| 33%   34C    P2    50W / 230W |   7200MiB / 16125MiB |     16%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   6  Quadro RTX 5000     On   | 00000000:40:00.0 Off |                  Off |
|  0%   40C    P2    55W / 230W |   7200MiB / 16125MiB |     16%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   7  Quadro RTX 5000     On   | 00000000:41:00.0 Off |                  Off |
| 33%   34C    P2    58W / 230W |   7200MiB / 16125MiB |     16%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A     22213      C   DlrmTrainer:0                   10059MiB |
|    0   N/A  N/A     22214      C   DlrmTrainer:1                     853MiB |
|    0   N/A  N/A     22215      C   DlrmTrainer:2                     873MiB |
|    0   N/A  N/A     22216      C   DlrmTrainer:3                     799MiB |
|    0   N/A  N/A     22217      C   DlrmTrainer:4                    1057MiB |
|    0   N/A  N/A     22218      C   DlrmTrainer:5                     773MiB |
|    0   N/A  N/A     22219      C   DlrmTrainer:6                     843MiB |
|    0   N/A  N/A     22220      C   DlrmTrainer:7                     765MiB |
|    1   N/A  N/A     22214      C   DlrmTrainer:1                    7177MiB |
|    2   N/A  N/A     22215      C   DlrmTrainer:2                    7177MiB |
|    3   N/A  N/A     22216      C   DlrmTrainer:3                    7177MiB |
|    4   N/A  N/A     22217      C   DlrmTrainer:4                    7177MiB |
|    5   N/A  N/A     22218      C   DlrmTrainer:5                    7177MiB |
|    6   N/A  N/A     22219      C   DlrmTrainer:6                    7177MiB |
|    7   N/A  N/A     22220      C   DlrmTrainer:7                    7177MiB |
+-----------------------------------------------------------------------------+

Why are processes 1-7 are allocating memory on GPU0 also?

This does not happen when I use NCCL as my backend.

My guess is that on Gloo backend, all the data from different GPUs need to somehow first implicitly gather the data to GPU 0 and then transfer it to CPU.

Please not that as also summarized in Distributed communication package - torch.distributed — PyTorch 1.8.0 documentation
This is the rule of thumb:

  • Use the NCCL backend for distributed GPU training.
  • Use the Gloo backend for distributed CPU training.

Therefore, we should always favor NCCL on GPUs.