Hello,
I am trying to get a multi gpu training sess on, but the process keeps hanging on dist.init_process_group( no error or any kind of INFO message, even with NCCL_DEBUG=INFO). Any insights on how to fix this?
Fri Sep 6 15:27:48 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 430.26 Driver Version: 430.26 CUDA Version: 10.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 TITAN V Off | 00000000:18:00.0 Off | N/A |
| 38% 54C P8 30W / 250W | 0MiB / 12066MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 TITAN V Off | 00000000:3B:00.0 Off | N/A |
| 44% 61C P8 40W / 250W | 0MiB / 12066MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 2 TITAN Xp Off | 00000000:86:00.0 Off | N/A |
| 25% 44C P8 10W / 250W | 2MiB / 12196MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 3 TITAN Xp Off | 00000000:AF:00.0 On | N/A |
| 25% 45C P5 20W / 250W | 1242MiB / 12193MiB | 2% Default |
+-------------------------------+----------------------+----------------------+
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2018 NVIDIA Corporation
Built on Sat_Aug_25_21:08:01_CDT_2018
Cuda compilation tools, release 10.0, V10.0.130
torch==0.4.1 (ALSO tried with 1.1)
torchaudio==0.2
torchsummary==1.5.1
torchtext==0.4.0
torchvision==0.4.0