Hi there,
I am noticing some peculiar behavior when training using gloo as my backend. I am running on an 8 GPU node and it seems like processes with ranks 0 through 7 are creating some footprint on GPU 0 when they shouldn’t? I am setting the device in each process with torch.cuda.set_device(rank). This is what the output from Nvidia-smi looks like.
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.39 Driver Version: 460.39 CUDA Version: 11.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Quadro RTX 5000 On | 00000000:1A:00.0 Off | Off |
| 33% 30C P2 53W / 230W | 16115MiB / 16125MiB | 14% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 Quadro RTX 5000 On | 00000000:1C:00.0 Off | Off |
| 33% 35C P2 61W / 230W | 7200MiB / 16125MiB | 17% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 2 Quadro RTX 5000 On | 00000000:1D:00.0 Off | Off |
| 33% 37C P2 58W / 230W | 7200MiB / 16125MiB | 16% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 3 Quadro RTX 5000 On | 00000000:1E:00.0 Off | Off |
| 33% 35C P2 57W / 230W | 7200MiB / 16125MiB | 16% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 4 Quadro RTX 5000 On | 00000000:3D:00.0 Off | Off |
| 33% 31C P2 49W / 230W | 7200MiB / 16125MiB | 15% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 5 Quadro RTX 5000 On | 00000000:3F:00.0 Off | Off |
| 33% 34C P2 50W / 230W | 7200MiB / 16125MiB | 16% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 6 Quadro RTX 5000 On | 00000000:40:00.0 Off | Off |
| 0% 40C P2 55W / 230W | 7200MiB / 16125MiB | 16% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 7 Quadro RTX 5000 On | 00000000:41:00.0 Off | Off |
| 33% 34C P2 58W / 230W | 7200MiB / 16125MiB | 16% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 22213 C DlrmTrainer:0 10059MiB |
| 0 N/A N/A 22214 C DlrmTrainer:1 853MiB |
| 0 N/A N/A 22215 C DlrmTrainer:2 873MiB |
| 0 N/A N/A 22216 C DlrmTrainer:3 799MiB |
| 0 N/A N/A 22217 C DlrmTrainer:4 1057MiB |
| 0 N/A N/A 22218 C DlrmTrainer:5 773MiB |
| 0 N/A N/A 22219 C DlrmTrainer:6 843MiB |
| 0 N/A N/A 22220 C DlrmTrainer:7 765MiB |
| 1 N/A N/A 22214 C DlrmTrainer:1 7177MiB |
| 2 N/A N/A 22215 C DlrmTrainer:2 7177MiB |
| 3 N/A N/A 22216 C DlrmTrainer:3 7177MiB |
| 4 N/A N/A 22217 C DlrmTrainer:4 7177MiB |
| 5 N/A N/A 22218 C DlrmTrainer:5 7177MiB |
| 6 N/A N/A 22219 C DlrmTrainer:6 7177MiB |
| 7 N/A N/A 22220 C DlrmTrainer:7 7177MiB |
+-----------------------------------------------------------------------------+
Why are processes 1-7 are allocating memory on GPU0 also?
This does not happen when I use NCCL as my backend.