DDP on Azure Cluster using MLOps Pipeline -- [E ProcessGroupGloo.cpp:2789] [Rank 0]: Rank 13 failed to pass monitoredBarrier in 1800000 ms

I am trying to use DDP for a training code that I use MLOps Pipeline Templates to run the job. Backend is NCCL

2023/05/04 19:35:32 WARNING mlflow.tracking.fluent: Exception raised while enabling autologging for sklearn: No module named 'sklearn.utils.testing'
Downloading: "https://download.pytorch.org/models/vgg19-dcbb9e9d.pth" to /root/.cache/torch/hub/checkpoints/vgg19-dcbb9e9d.pth
CPython
3.8.13
uname_result(system='Linux', node='e6bf0f05f6f44321b97ab2dff8cda44c000000', release='5.0.0-1036-azure', version='#38-Ubuntu SMP Sun Mar 22 21:27:21 UTC 2020', machine='x86_64', processor='x86_64')
training script path:  /mnt/azureml/cr/j/05ab7c96066d41ea841fe18077f16b59/exe/wd
start: 19:35:32.127799
manual seed set to 4395
opt.checkpoints = /mnt/azureml/cr/j/05ab7c96066d41ea841fe18077f16b59/cap/data-capability/wd/checkpoints
world size is:  16
global rank is 0 and local_rank is 0
is_distributed is True and batch_size is 2
os.getpid() is 39 and initializing process group with {'MASTER_ADDR': '10.0.0.4', 'MASTER_PORT': '6105', 'LOCAL_RANK': '0', 'RANK': '0', 'WORLD_SIZE': '16'}
device is cuda:0
MLflow version: 1.25.1
Tracking URI: my_URI
Artifact URI: my_URI
load data
train data size:  246000
training data len:  246000
batch size is:  2
training data: 7688 batches
load models
torch.cuda.device_count():  4
type opt.gpuids: <class 'list'>
gpuids are: [0, 1, 2, 3]
Training network pretrained on imagenet.

  0%|          | 0.00/548M [00:00<?, ?B/s]
  0%|          | 1.54M/548M [00:00<00:35, 16.1MB/s]
  1%|          | 5.98M/548M [00:00<00:17, 31.8MB/s]
  3%|β–Ž         | 17.3M/548M [00:00<00:07, 70.1MB/s]
  4%|▍         | 24.1M/548M [00:00<00:08, 65.9MB/s]
  7%|β–‹         | 37.0M/548M [00:00<00:06, 89.2MB/s]
  9%|β–‰         | 48.4M/548M [00:00<00:05, 99.3MB/s]
 11%|β–ˆ         | 61.1M/548M [00:00<00:04, 108MB/s] 
 13%|β–ˆβ–Ž        | 72.5M/548M [00:00<00:04, 112MB/s]
 15%|β–ˆβ–Œ        | 83.3M/548M [00:00<00:04, 99.1MB/s]
 17%|β–ˆβ–‹        | 93.3M/548M [00:01<00:04, 101MB/s] 
 19%|β–ˆβ–‰        | 106M/548M [00:01<00:04, 111MB/s] 
 22%|β–ˆβ–ˆβ–       | 119M/548M [00:01<00:03, 113MB/s]
 24%|β–ˆβ–ˆβ–       | 131M/548M [00:01<00:03, 115MB/s]
 26%|β–ˆβ–ˆβ–Œ       | 142M/548M [00:01<00:04, 97.3MB/s]
 28%|β–ˆβ–ˆβ–Š       | 152M/548M [00:01<00:04, 92.7MB/s]
 29%|β–ˆβ–ˆβ–‰       | 161M/548M [00:01<00:04, 91.4MB/s]
 31%|β–ˆβ–ˆβ–ˆ       | 171M/548M [00:01<00:04, 93.8MB/s]
 33%|β–ˆβ–ˆβ–ˆβ–Ž      | 180M/548M [00:02<00:04, 92.1MB/s]
 34%|β–ˆβ–ˆβ–ˆβ–      | 189M/548M [00:02<00:04, 86.2MB/s]
 36%|β–ˆβ–ˆβ–ˆβ–‹      | 199M/548M [00:02<00:04, 89.7MB/s]
 38%|β–ˆβ–ˆβ–ˆβ–Š      | 209M/548M [00:02<00:03, 93.3MB/s]
 40%|β–ˆβ–ˆβ–ˆβ–ˆ      | 222M/548M [00:02<00:03, 105MB/s] 
 42%|β–ˆβ–ˆβ–ˆβ–ˆβ–     | 232M/548M [00:02<00:03, 103MB/s]
 44%|β–ˆβ–ˆβ–ˆβ–ˆβ–     | 242M/548M [00:02<00:03, 95.5MB/s]
 46%|β–ˆβ–ˆβ–ˆβ–ˆβ–‹     | 254M/548M [00:02<00:03, 103MB/s] 
 48%|β–ˆβ–ˆβ–ˆβ–ˆβ–Š     | 264M/548M [00:02<00:03, 96.8MB/s]
 50%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆ     | 274M/548M [00:03<00:02, 100MB/s] 
 52%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–    | 284M/548M [00:03<00:02, 97.3MB/s]
 54%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž    | 294M/548M [00:03<00:02, 101MB/s] 
 55%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ    | 304M/548M [00:03<00:02, 98.2MB/s]
 58%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š    | 318M/548M [00:03<00:02, 111MB/s] 
 60%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰    | 329M/548M [00:03<00:02, 104MB/s]
 62%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–   | 339M/548M [00:03<00:02, 93.6MB/s]
 63%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž   | 348M/548M [00:04<00:03, 57.2MB/s]
 65%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–   | 355M/548M [00:04<00:03, 55.3MB/s]
 66%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ   | 361M/548M [00:04<00:03, 57.2MB/s]
 68%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š   | 371M/548M [00:04<00:02, 66.9MB/s]
 70%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰   | 382M/548M [00:04<00:02, 79.0MB/s]
 71%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–  | 392M/548M [00:04<00:01, 83.1MB/s]
 73%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž  | 401M/548M [00:04<00:01, 87.3MB/s]
 75%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–  | 410M/548M [00:04<00:01, 89.8MB/s]
 77%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹  | 422M/548M [00:04<00:01, 98.6MB/s]
 79%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰  | 433M/548M [00:05<00:01, 103MB/s] 
 81%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ  | 444M/548M [00:05<00:01, 102MB/s]
 83%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž | 456M/548M [00:05<00:00, 108MB/s]
 86%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ | 472M/548M [00:05<00:00, 123MB/s]
 89%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰ | 490M/548M [00:05<00:00, 141MB/s]
 92%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–| 506M/548M [00:05<00:00, 149MB/s]
 95%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ| 521M/548M [00:05<00:00, 150MB/s]
 98%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š| 537M/548M [00:05<00:00, 156MB/s]
100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 548M/548M [00:05<00:00, 99.0MB/s]e6bf0f05f6f44321b97ab2dff8cda44c000000:39:39 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth0
e6bf0f05f6f44321b97ab2dff8cda44c000000:39:39 [0] NCCL INFO Bootstrap : Using eth0:10.0.0.4<0>
e6bf0f05f6f44321b97ab2dff8cda44c000000:39:39 [0] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
e6bf0f05f6f44321b97ab2dff8cda44c000000:39:39 [0] NCCL INFO P2P plugin IBext
e6bf0f05f6f44321b97ab2dff8cda44c000000:39:39 [0] NCCL INFO NCCL_IB_PCI_RELAXED_ORDERING set by environment to 1.
e6bf0f05f6f44321b97ab2dff8cda44c000000:39:39 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth0

e6bf0f05f6f44321b97ab2dff8cda44c000000:39:39 [0] ibvwrap.c:66 NCCL WARN Call to ibv_open_device failed

e6bf0f05f6f44321b97ab2dff8cda44c000000:39:39 [0] p2p_plugin.c:190 NCCL WARN NET/IB : Unable to open device mlx4_0
e6bf0f05f6f44321b97ab2dff8cda44c000000:39:39 [0] NCCL INFO NET/IB : No device found.
e6bf0f05f6f44321b97ab2dff8cda44c000000:39:39 [0] NCCL INFO NCCL_IB_DISABLE set by environment to 1.
e6bf0f05f6f44321b97ab2dff8cda44c000000:39:39 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth0
e6bf0f05f6f44321b97ab2dff8cda44c000000:39:39 [0] NCCL INFO NET/Socket : Using [0]eth0:10.0.0.4<0>
e6bf0f05f6f44321b97ab2dff8cda44c000000:39:39 [0] NCCL INFO Using network Socket
NCCL version 2.10.3+cuda10.2
e6bf0f05f6f44321b97ab2dff8cda44c000000:39:396 [0] NCCL INFO Topology detection : could not read /sys/devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A03:00/device:07/VMBUS:01/47505500-0001-0000-3130-444531303244/pci13ff:00/13ff:00:00.0/../max_link_speed, ignoring
e6bf0f05f6f44321b97ab2dff8cda44c000000:39:396 [0] NCCL INFO Topology detection : could not read /sys/devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A03:00/device:07/VMBUS:01/47505500-0001-0000-3130-444531303244/pci13ff:00/13ff:00:00.0/../max_link_width, ignoring
e6bf0f05f6f44321b97ab2dff8cda44c000000:39:396 [0] NCCL INFO Topology detection : could not read /sys/devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A03:00/device:07/VMBUS:01/47505500-0002-0000-3130-444531303244/pci2ad7:00/2ad7:00:00.0/../max_link_speed, ignoring
e6bf0f05f6f44321b97ab2dff8cda44c000000:39:396 [0] NCCL INFO Topology detection : could not read /sys/devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A03:00/device:07/VMBUS:01/47505500-0002-0000-3130-444531303244/pci2ad7:00/2ad7:00:00.0/../max_link_width, ignoring
e6bf0f05f6f44321b97ab2dff8cda44c000000:39:396 [0] NCCL INFO Topology detection : could not read /sys/devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A03:00/device:07/VMBUS:01/47505500-0003-0000-3130-444531303244/pci373d:00/373d:00:00.0/../max_link_speed, ignoring
e6bf0f05f6f44321b97ab2dff8cda44c000000:39:396 [0] NCCL INFO Topology detection : could not read /sys/devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A03:00/device:07/VMBUS:01/47505500-0003-0000-3130-444531303244/pci373d:00/373d:00:00.0/../max_link_width, ignoring
e6bf0f05f6f44321b97ab2dff8cda44c000000:39:396 [0] NCCL INFO Topology detection : could not read /sys/devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A03:00/device:07/VMBUS:01/47505500-0004-0000-3130-444531303244/pci529e:00/529e:00:00.0/../max_link_speed, ignoring
e6bf0f05f6f44321b97ab2dff8cda44c000000:39:396 [0] NCCL INFO Topology detection : could not read /sys/devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A03:00/device:07/VMBUS:01/47505500-0004-0000-3130-444531303244/pci529e:00/529e:00:00.0/../max_link_width, ignoring
e6bf0f05f6f44321b97ab2dff8cda44c000000:39:396 [0] NCCL INFO Topology detection: network path /sys/devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A03:00/device:07/VMBUS:01/000d3adc-6e3b-000d-3adc-6e3b000d3adc is not a PCI device (vmbus). Attaching to first CPU
e6bf0f05f6f44321b97ab2dff8cda44c000000:39:396 [0] NCCL INFO KV Convert to int : could not find value of '' in dictionary, falling back to 60
e6bf0f05f6f44321b97ab2dff8cda44c000000:39:396 [0] NCCL INFO KV Convert to int : could not find value of '' in dictionary, falling back to 60
e6bf0f05f6f44321b97ab2dff8cda44c000000:39:396 [0] NCCL INFO KV Convert to int : could not find value of '' in dictionary, falling back to 60
e6bf0f05f6f44321b97ab2dff8cda44c000000:39:396 [0] NCCL INFO KV Convert to int : could not find value of '' in dictionary, falling back to 60
e6bf0f05f6f44321b97ab2dff8cda44c000000:39:396 [0] NCCL INFO Attribute coll of node net not found
e6bf0f05f6f44321b97ab2dff8cda44c000000:39:396 [0] NCCL INFO === System : maxWidth 5.0 totalWidth 12.0 ===
e6bf0f05f6f44321b97ab2dff8cda44c000000:39:396 [0] NCCL INFO CPU/0 (1/1/1)
e6bf0f05f6f44321b97ab2dff8cda44c000000:39:396 [0] NCCL INFO + PCI[5000.0] - NIC/0
e6bf0f05f6f44321b97ab2dff8cda44c000000:39:396 [0] NCCL INFO                 + NET[5.0] - NET/0 (0/0/5.000000)
e6bf0f05f6f44321b97ab2dff8cda44c000000:39:396 [0] NCCL INFO + PCI[12.0] - GPU/13FF00000 (0)
e6bf0f05f6f44321b97ab2dff8cda44c000000:39:396 [0] NCCL INFO + PCI[12.0] - GPU/2AD700000 (1)
e6bf0f05f6f44321b97ab2dff8cda44c000000:39:396 [0] NCCL INFO + PCI[12.0] - GPU/373D00000 (2)
e6bf0f05f6f44321b97ab2dff8cda44c000000:39:396 [0] NCCL INFO + PCI[12.0] - GPU/529E00000 (3)
e6bf0f05f6f44321b97ab2dff8cda44c000000:39:396 [0] NCCL INFO ==========================================
e6bf0f05f6f44321b97ab2dff8cda44c000000:39:396 [0] NCCL INFO GPU/13FF00000 :GPU/13FF00000 (0/5000.000000/LOC) GPU/2AD700000 (2/12.000000/PHB) GPU/373D00000 (2/12.000000/PHB) GPU/529E00000 (2/12.000000/PHB) CPU/0 (1/12.000000/PHB) NET/0 (3/5.000000/PHB) 
e6bf0f05f6f44321b97ab2dff8cda44c000000:39:396 [0] NCCL INFO GPU/2AD700000 :GPU/13FF00000 (2/12.000000/PHB) GPU/2AD700000 (0/5000.000000/LOC) GPU/373D00000 (2/12.000000/PHB) GPU/529E00000 (2/12.000000/PHB) CPU/0 (1/12.000000/PHB) NET/0 (3/5.000000/PHB) 
e6bf0f05f6f44321b97ab2dff8cda44c000000:39:396 [0] NCCL INFO GPU/373D00000 :GPU/13FF00000 (2/12.000000/PHB) GPU/2AD700000 (2/12.000000/PHB) GPU/373D00000 (0/5000.000000/LOC) GPU/529E00000 (2/12.000000/PHB) CPU/0 (1/12.000000/PHB) NET/0 (3/5.000000/PHB) 
e6bf0f05f6f44321b97ab2dff8cda44c000000:39:396 [0] NCCL INFO GPU/529E00000 :GPU/13FF00000 (2/12.000000/PHB) GPU/2AD700000 (2/12.000000/PHB) GPU/373D00000 (2/12.000000/PHB) GPU/529E00000 (0/5000.000000/LOC) CPU/0 (1/12.000000/PHB) NET/0 (3/5.000000/PHB) 
e6bf0f05f6f44321b97ab2dff8cda44c000000:39:396 [0] NCCL INFO NET/0 :GPU/13FF00000 (3/5.000000/PHB) GPU/2AD700000 (3/5.000000/PHB) GPU/373D00000 (3/5.000000/PHB) GPU/529E00000 (3/5.000000/PHB) CPU/0 (2/5.000000/PHB) NET/0 (0/5000.000000/LOC) 
e6bf0f05f6f44321b97ab2dff8cda44c000000:39:396 [0] NCCL INFO Pattern 4, crossNic 0, nChannels 1, speed 5.000000/5.000000, type PHB/PHB, sameChannels 1
e6bf0f05f6f44321b97ab2dff8cda44c000000:39:396 [0] NCCL INFO  0 : NET/0 GPU/0 GPU/1 GPU/2 GPU/3 NET/0
e6bf0f05f6f44321b97ab2dff8cda44c000000:39:396 [0] NCCL INFO Pattern 1, crossNic 0, nChannels 1, speed 6.000000/5.000000, type PHB/PHB, sameChannels 1
e6bf0f05f6f44321b97ab2dff8cda44c000000:39:396 [0] NCCL INFO  0 : NET/0 GPU/0 GPU/1 GPU/2 GPU/3 NET/0
e6bf0f05f6f44321b97ab2dff8cda44c000000:39:396 [0] NCCL INFO Pattern 3, crossNic 0, nChannels 0, speed 0.000000/0.000000, type NVL/PIX, sameChannels 1
e6bf0f05f6f44321b97ab2dff8cda44c000000:39:396 [0] NCCL INFO Tree 0 : -1 -> 0 -> 1/8/-1
e6bf0f05f6f44321b97ab2dff8cda44c000000:39:396 [0] NCCL INFO Tree 1 : 4 -> 0 -> 1/-1/-1
e6bf0f05f6f44321b97ab2dff8cda44c000000:39:396 [0] NCCL INFO Channel 00/02 :    0   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15
e6bf0f05f6f44321b97ab2dff8cda44c000000:39:396 [0] NCCL INFO Channel 01/02 :    0   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15
e6bf0f05f6f44321b97ab2dff8cda44c000000:39:396 [0] NCCL INFO Ring 00 : 15 -> 0 -> 1
e6bf0f05f6f44321b97ab2dff8cda44c000000:39:396 [0] NCCL INFO Ring 01 : 15 -> 0 -> 1
e6bf0f05f6f44321b97ab2dff8cda44c000000:39:396 [0] NCCL INFO Trees [0] 1/8/-1->0->-1 [1] 1/-1/-1->0->4
e6bf0f05f6f44321b97ab2dff8cda44c000000:39:396 [0] NCCL INFO Setting affinity for GPU 0 to 0fff
e6bf0f05f6f44321b97ab2dff8cda44c000000:39:396 [0] NCCL INFO Channel 00 : 15[85f000000] -> 0[13ff00000] [receive] via NET/Socket/0
e6bf0f05f6f44321b97ab2dff8cda44c000000:39:396 [0] NCCL INFO Channel 01 : 15[85f000000] -> 0[13ff00000] [receive] via NET/Socket/0
e6bf0f05f6f44321b97ab2dff8cda44c000000:39:396 [0] NCCL INFO Channel 00 : 0[13ff00000] -> 1[2ad700000] via direct shared memory
e6bf0f05f6f44321b97ab2dff8cda44c000000:39:396 [0] NCCL INFO Channel 01 : 0[13ff00000] -> 1[2ad700000] via direct shared memory
e6bf0f05f6f44321b97ab2dff8cda44c000000:39:396 [0] NCCL INFO Connected all rings
e6bf0f05f6f44321b97ab2dff8cda44c000000:39:396 [0] NCCL INFO Channel 01 : 0[13ff00000] -> 4[1f4000000] [send] via NET/Socket/0
e6bf0f05f6f44321b97ab2dff8cda44c000000:39:396 [0] NCCL INFO Channel 00 : 8[1e5b00000] -> 0[13ff00000] [receive] via NET/Socket/0
e6bf0f05f6f44321b97ab2dff8cda44c000000:39:396 [0] NCCL INFO Channel 00 : 0[13ff00000] -> 8[1e5b00000] [send] via NET/Socket/0
e6bf0f05f6f44321b97ab2dff8cda44c000000:39:396 [0] NCCL INFO Channel 01 : 4[1f4000000] -> 0[13ff00000] [receive] via NET/Socket/0
e6bf0f05f6f44321b97ab2dff8cda44c000000:39:396 [0] NCCL INFO Connected all trees
e6bf0f05f6f44321b97ab2dff8cda44c000000:39:396 [0] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 8/8/512
e6bf0f05f6f44321b97ab2dff8cda44c000000:39:396 [0] NCCL INFO 2 coll channels, 2 p2p channels, 1 p2p channels per peer
e6bf0f05f6f44321b97ab2dff8cda44c000000:39:396 [0] NCCL INFO comm 0x14d3a4001240 rank 0 nranks 16 cudaDev 0 busId 13ff00000 - Init COMPLETE
e6bf0f05f6f44321b97ab2dff8cda44c000000:39:39 [0] NCCL INFO Launch mode Parallel

[E ProcessGroupGloo.cpp:2789] [Rank 0]: Rank 4 failed to pass monitoredBarrier in 1800000 ms
[E ProcessGroupGloo.cpp:2789] [Rank 0]: Rank 5 failed to pass monitoredBarrier in 1800000 ms
[E ProcessGroupGloo.cpp:2789] [Rank 0]: Rank 6 failed to pass monitoredBarrier in 1800000 ms
[E ProcessGroupGloo.cpp:2789] [Rank 0]: Rank 7 failed to pass monitoredBarrier in 1800000 ms
[E ProcessGroupGloo.cpp:2789] [Rank 0]: Rank 8 failed to pass monitoredBarrier in 1800000 ms
[E ProcessGroupGloo.cpp:2789] [Rank 0]: Rank 9 failed to pass monitoredBarrier in 1800000 ms
[E ProcessGroupGloo.cpp:2789] [Rank 0]: Rank 10 failed to pass monitoredBarrier in 1800000 ms
[E ProcessGroupGloo.cpp:2789] [Rank 0]: Rank 11 failed to pass monitoredBarrier in 1800000 ms
[E ProcessGroupGloo.cpp:2789] [Rank 0]: Rank 12 failed to pass monitoredBarrier in 1800000 ms
[E ProcessGroupGloo.cpp:2789] [Rank 0]: Rank 13 failed to pass monitoredBarrier in 1800000 ms
[E ProcessGroupGloo.cpp:2789] [Rank 0]: Rank 14 failed to pass monitoredBarrier in 1800000 ms
[E ProcessGroupGloo.cpp:2789] [Rank 0]: Rank 15 failed to pass monitoredBarrier in 1800000 ms
[E ProcessGroupGloo.cpp:136] Ranks 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15 failed to pass monitoredBarrier in 1800000 ms
Traceback (most recent call last):
  File "train.py", line 1569, in <module>
    _runnetwork(epoch,trainingdata)
  File "train.py", line 1528, in _runnetwork
    loss.backward()
  File "/opt/conda/lib/python3.8/site-packages/torch/_tensor.py", line 363, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "/opt/conda/lib/python3.8/site-packages/torch/autograd/__init__.py", line 173, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
RuntimeError: Ranks 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15 failed to pass monitoredBarrier in 1800000 ms

I see this error in Job Overview in Pipeline menu:

4 processes failed. Please see below for their details:
Execution failed. User process 'Rank 12' exited with status code 1. Please check log file 'user_logs/std_log_process_12.txt' for error details. Error: Traceback (most recent call last):
  File "train.py", line 1569, in <module>
    _runnetwork(epoch,trainingdata)
  File "train.py", line 1504, in _runnetwork
    output_belief, output_affinities = net(data)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 962, in forward
    inputs, kwargs = self.to_kwargs(inputs, kwargs, self.device_ids[0])
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 1086, in to_kwargs
    inputs = self._recursive_to(inputs, device_id) if inputs else []
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 1080, in _recursive_to
    res = to_map(inputs)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 1059, in to_map
    return list(zip(*map(to_map, obj)))
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 1044, in to_map
    stream = _get_stream(target_gpu)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/parallel/_functions.py", line 122, in _get_stream
    if _streams[device] is None:
IndexError: list index out of range

Execution failed. User process 'Rank 13' exited with status code 1. Please check log file 'user_logs/std_log_process_13.txt' for error details. Error: Traceback (most recent call last):
  File "train.py", line 1569, in <module>
    _runnetwork(epoch,trainingdata)
  File "train.py", line 1504, in _runnetwork
    output_belief, output_affinities = net(data)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 962, in forward
    inputs, kwargs = self.to_kwargs(inputs, kwargs, self.device_ids[0])
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 1086, in to_kwargs
    inputs = self._recursive_to(inputs, device_id) if inputs else []
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 1080, in _recursive_to
    res = to_map(inputs)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 1059, in to_map
    return list(zip(*map(to_map, obj)))
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 1044, in to_map
    stream = _get_stream(target_gpu)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/parallel/_functions.py", line 122, in _get_stream
    if _streams[device] is None:
IndexError: list index out of range

Execution failed. User process 'Rank 14' exited with s

My cluster has 4 nodes and each node has 4 GPUs.
Screenshot from 2023-05-05 09-32-09
Screenshot from 2023-05-05 09-33-08

I am not sure which one of these is exactly correct but all three give errors:

# net = torch.nn.parallel.DistributedDataParallel(net, device_ids=[local_rank % torch.cuda.device_count()],
#                                                 output_device=local_rank % torch.cuda.device_count())
net = torch.nn.parallel.DistributedDataParallel(net, device_ids=[torch.distributed.get_rank()],
                                                output_device=torch.distributed.get_rank())
#net = torch.nn.parallel.DistributedDataParallel(net, device_ids=[local_rank]).to(local_rank) #which one is correct

Also, as for URL, I am not sure which one to exactly use, each throws a different error. I won’t know ahead of time who my master node is since I am using MLOps pipeline templates:

    
dist_url = "env://" # default
# dist_url = "auto"

is_distributed = world_size > 1

if is_distributed:
    batch_size = opt.batch_size // world_size
    batch_size = max(batch_size, 1)
else:
    batch_size = opt.batch_size
    
print("is_distributed is {} and batch_size is {}".format(is_distributed, batch_size))


env_dict = {
    key: os.environ[key]
    for key in ("MASTER_ADDR", "MASTER_PORT","LOCAL_RANK", "RANK", "WORLD_SIZE")
}



Here’s what the command is in train.yaml inside mlops --> azureml --> train folder.

command: >-
  python train.py
  --data ${{inputs.train_data}}
  $[[--batch_size ${{inputs.batch_size}}]]
  $[[--num_workers ${{inputs.num_workers}}]]
  $[[--prefetch_factor ${{inputs.prefetch_factor}}]]
  $[[--persistent_workers ${{inputs.persistent_workers}}]]
  $[[--pin_memory ${{inputs.pin_memory}}]]
  $[[--non_blocking ${{inputs.non_blocking}}]]
  $[[--model_arch ${{inputs.model_arch}}]]
  $[[--model_arch_pretrained ${{inputs.model_arch_pretrained}}]]
  $[[--num_epochs ${{inputs.num_epochs}}]]
  $[[--learning_rate ${{inputs.learning_rate}}]]
  --checkpoints ${{outputs.checkpoints}}
  $[[--register_model_as ${{inputs.register_model_as}}]]
  --enable_profiling ${{inputs.enable_profiling}}
  $[[--multiprocessing_sharing_strategy ${{inputs.multiprocessing_sharing_strategy}}]]

Screenshot from 2023-05-05 09-37-24

Can you please guide me how to fix this?

The problem is that you’re passing distributed.get_rank() as the device to DDP:

DistributedDataParallel(
   net,
   device_ids=[distributed.get_rank()],
   output_device=distributed.get_rank())

get_rank() returns the global rank of each worker, which on a 4 nodes with 4 GPUs each means values between 0 and 15.
Device numbers are from 0-3 on all ranks and that’s the issue.

An easy fix is to use distributed.get_rank() % torch.cuda.device_count()

1 Like

Thanks a lot for your response, Rodrigo.
The job is running (no error yet) after I used this other code suggested by you:

net = torch.nn.parallel.DistributedDataParallel(net, device_ids=[torch.distributed.get_rank() % torch.cuda.device_count()], output_device=torch.distributed.get_rank() % torch.cuda.device_count())

Do you know why in the Azure Cluster with 4 nodes it is showing only 1 active run and the others as busy? Shouldn’t all nodes be active?

Also, from PyTorch DDP side, is there a preference between acquiring the cluster with r in them or without r?

Standard_NC24r vs Standard_NC24

The job failed.

2023/05/05 14:38:55 WARNING mlflow.tracking.fluent: Exception raised while enabling autologging for sklearn: No module named 'sklearn.utils.testing'
Downloading: "https://download.pytorch.org/models/vgg19-dcbb9e9d.pth" to /root/.cache/torch/hub/checkpoints/vgg19-dcbb9e9d.pth
CPython
3.8.13
uname_result(system='Linux', node='bc5e5260f29343048019567c5f2d8dd0000001', release='5.0.0-1036-azure', version='#38-Ubuntu SMP Sun Mar 22 21:27:21 UTC 2020', machine='x86_64', processor='x86_64')
training script path:  /mnt/azureml/cr/j/e4e8b049b9d3446fbb7b7b7d6bcf081d/exe/wd
start: 14:38:55.579723
manual seed set to 7071
opt.checkpoints = /mnt/azureml/cr/j/e4e8b049b9d3446fbb7b7b7d6bcf081d/cap/data-capability/wd/checkpoints
world size is:  16
global rank is 0 and local_rank is 0
is_distributed is True and batch_size is 2
os.getpid() is 39 and initializing process group with {'MASTER_ADDR': '10.0.0.5', 'MASTER_PORT': '6105', 'LOCAL_RANK': '0', 'RANK': '0', 'WORLD_SIZE': '16'}
device is cuda:0
MLflow version: 1.25.1
Tracking URI: azureml://eastus2.api.azureml.ms/mlflow/v1.0/subscriptions/9be1367a-bcc9-4275-8b3d-a0469f4119fa/resourceGroups/TMH-ARD-tmhml03-01-dev/providers/Microsoft.MachineLearningServices/workspaces/mlw-tmhml03-01-dev
Artifact URI: azureml://experiments/dev_DOPE_FAT_test2/runs/03fc5ff6-bf3b-4b8e-b5fb-a3ffe3218e04/artifacts
load data
train data size:  246000
training data len:  246000
batch size is:  2
training data: 7688 batches
load models
torch.cuda.device_count():  4
type opt.gpuids: <class 'list'>
gpuids are: [0, 1, 2, 3]
Training network pretrained on imagenet.

  0%|          | 0.00/548M [00:00<?, ?B/s]
  2%|▏         | 8.97M/548M [00:00<00:06, 87.7MB/s]
  4%|▍         | 24.4M/548M [00:00<00:04, 130MB/s] 
  7%|β–‹         | 36.9M/548M [00:00<00:04, 129MB/s]
 10%|β–‰         | 53.2M/548M [00:00<00:03, 145MB/s]
 12%|β–ˆβ–        | 67.0M/548M [00:00<00:03, 137MB/s]
 15%|β–ˆβ–        | 80.2M/548M [00:00<00:03, 136MB/s]
 17%|β–ˆβ–‹        | 94.2M/548M [00:00<00:03, 134MB/s]
 20%|β–ˆβ–ˆ        | 110M/548M [00:00<00:03, 143MB/s] 
 23%|β–ˆβ–ˆβ–Ž       | 123M/548M [00:00<00:03, 142MB/s]
 25%|β–ˆβ–ˆβ–Œ       | 138M/548M [00:01<00:02, 144MB/s]
 28%|β–ˆβ–ˆβ–Š       | 152M/548M [00:01<00:02, 139MB/s]
 30%|β–ˆβ–ˆβ–ˆ       | 166M/548M [00:01<00:02, 143MB/s]
 33%|β–ˆβ–ˆβ–ˆβ–Ž      | 181M/548M [00:01<00:02, 146MB/s]
 36%|β–ˆβ–ˆβ–ˆβ–Œ      | 195M/548M [00:01<00:02, 147MB/s]
 38%|β–ˆβ–ˆβ–ˆβ–Š      | 210M/548M [00:01<00:02, 148MB/s]
 41%|β–ˆβ–ˆβ–ˆβ–ˆ      | 224M/548M [00:01<00:02, 150MB/s]
 44%|β–ˆβ–ˆβ–ˆβ–ˆβ–Ž     | 239M/548M [00:01<00:02, 152MB/s]
 46%|β–ˆβ–ˆβ–ˆβ–ˆβ–‹     | 254M/548M [00:01<00:02, 142MB/s]
 49%|β–ˆβ–ˆβ–ˆβ–ˆβ–‰     | 269M/548M [00:01<00:01, 147MB/s]
 52%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–    | 283M/548M [00:02<00:01, 143MB/s]
 54%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–    | 297M/548M [00:02<00:01, 133MB/s]
 57%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹    | 310M/548M [00:02<00:01, 132MB/s]
 59%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰    | 325M/548M [00:02<00:01, 140MB/s]
 62%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–   | 340M/548M [00:02<00:01, 143MB/s]
 65%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–   | 354M/548M [00:02<00:01, 144MB/s]
 67%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹   | 368M/548M [00:02<00:01, 131MB/s]
 70%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ   | 385M/548M [00:02<00:01, 145MB/s]
 73%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž  | 401M/548M [00:02<00:01, 151MB/s]
 76%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ  | 416M/548M [00:03<00:00, 154MB/s]
 79%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰  | 434M/548M [00:03<00:00, 164MB/s]
 83%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž | 456M/548M [00:03<00:00, 183MB/s]
 88%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š | 482M/548M [00:03<00:00, 210MB/s]
 93%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž| 510M/548M [00:03<00:00, 233MB/s]
 98%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š| 538M/548M [00:03<00:00, 251MB/s]
100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 548M/548M [00:03<00:00, 160MB/s][E ProcessGroupGloo.cpp:2789] [Rank 0]: Rank 4 failed to pass monitoredBarrier in 1800000 ms
[E ProcessGroupGloo.cpp:2789] [Rank 0]: Rank 5 failed to pass monitoredBarrier in 1800000 ms
[E ProcessGroupGloo.cpp:2789] [Rank 0]: Rank 6 failed to pass monitoredBarrier in 1800000 ms
[E ProcessGroupGloo.cpp:2789] [Rank 0]: Rank 7 failed to pass monitoredBarrier in 1800000 ms
[E ProcessGroupGloo.cpp:136] Ranks 4, 5, 6, 7 failed to pass monitoredBarrier in 1800000 ms

Traceback (most recent call last):
  File "train.py", line 1474, in <module>
    net = torch.nn.parallel.DistributedDataParallel(net, device_ids=[torch.distributed.get_rank() % torch.cuda.device_count()],
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 641, in __init__
    dist._verify_params_across_processes(self.process_group, parameters)
RuntimeError: Ranks 4, 5, 6, 7 failed to pass monitoredBarrier in 1800000 ms

I also ran it with this DDP code and still same error:

2023/05/05 15:07:10 WARNING mlflow.tracking.fluent: Exception raised while enabling autologging for sklearn: No module named 'sklearn.utils.testing'
Downloading: "https://download.pytorch.org/models/vgg19-dcbb9e9d.pth" to /root/.cache/torch/hub/checkpoints/vgg19-dcbb9e9d.pth
CPython
3.8.13
uname_result(system='Linux', node='0d04b15c28814731a1ebc312d3fcb995000000', release='5.0.0-1036-azure', version='#38-Ubuntu SMP Sun Mar 22 21:27:21 UTC 2020', machine='x86_64', processor='x86_64')
training script path:  /mnt/azureml/cr/j/fe9461a9ae164e0aab83f2dd41ca43e7/exe/wd
start: 15:07:10.743225
manual seed set to 453
opt.checkpoints = /mnt/azureml/cr/j/fe9461a9ae164e0aab83f2dd41ca43e7/cap/data-capability/wd/checkpoints
world size is:  16
global rank is 0 and local_rank is 0
is_distributed is True and batch_size is 2
os.getpid() is 38 and initializing process group with {'MASTER_ADDR': '10.0.0.4', 'MASTER_PORT': '6105', 'LOCAL_RANK': '0', 'RANK': '0', 'WORLD_SIZE': '16'}
device is cuda:0
MLflow version: 1.25.1
Tracking URI: azureml://eastus2.api.azureml.ms/mlflow/v1.0/subscriptions/9be1367a-bcc9-4275-8b3d-a0469f4119fa/resourceGroups/TMH-ARD-tmhml03-01-dev/providers/Microsoft.MachineLearningServices/workspaces/mlw-tmhml03-01-dev
Artifact URI: azureml://experiments/dev_DOPE_FAT_test2/runs/2f08abdf-57f0-4bad-8929-14d01b4b2d38/artifacts
load data
train data size:  246000
training data len:  246000
batch size is:  2
training data: 7688 batches
load models
torch.cuda.device_count():  4
type opt.gpuids: <class 'list'>
gpuids are: [0, 1, 2, 3]
Training network pretrained on imagenet.

  0%|          | 0.00/548M [00:00<?, ?B/s]
  1%|▏         | 6.99M/548M [00:00<00:07, 73.3MB/s]
  6%|β–Œ         | 30.3M/548M [00:00<00:03, 174MB/s] 
 10%|β–‰         | 52.7M/548M [00:00<00:02, 202MB/s]
 14%|β–ˆβ–        | 76.0M/548M [00:00<00:02, 219MB/s]
 18%|β–ˆβ–Š        | 99.2M/548M [00:00<00:02, 228MB/s]
 22%|β–ˆβ–ˆβ–       | 123M/548M [00:00<00:01, 235MB/s] 
 27%|β–ˆβ–ˆβ–‹       | 147M/548M [00:00<00:01, 240MB/s]
 31%|β–ˆβ–ˆβ–ˆ       | 170M/548M [00:00<00:01, 242MB/s]
 35%|β–ˆβ–ˆβ–ˆβ–Œ      | 194M/548M [00:00<00:01, 245MB/s]
 40%|β–ˆβ–ˆβ–ˆβ–‰      | 218M/548M [00:01<00:01, 247MB/s]
 44%|β–ˆβ–ˆβ–ˆβ–ˆβ–     | 242M/548M [00:01<00:01, 248MB/s]
 48%|β–ˆβ–ˆβ–ˆβ–ˆβ–Š     | 266M/548M [00:01<00:01, 238MB/s]
 53%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž    | 289M/548M [00:01<00:01, 239MB/s]
 57%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹    | 312M/548M [00:01<00:01, 241MB/s]
 61%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–   | 336M/548M [00:01<00:00, 244MB/s]
 66%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ   | 359M/548M [00:01<00:00, 243MB/s]
 70%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰   | 383M/548M [00:01<00:00, 244MB/s]
 74%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–  | 407M/548M [00:01<00:00, 246MB/s]
 79%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š  | 430M/548M [00:01<00:00, 243MB/s]
 83%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž | 454M/548M [00:02<00:00, 244MB/s]
 87%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹ | 477M/548M [00:02<00:00, 245MB/s]
 91%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–| 501M/548M [00:02<00:00, 246MB/s]
 96%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ| 524M/548M [00:02<00:00, 242MB/s]
100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 548M/548M [00:02<00:00, 237MB/s][E ProcessGroupGloo.cpp:2789] [Rank 0]: Rank 10 failed to pass monitoredBarrier in 1800000 ms
[E ProcessGroupGloo.cpp:136] Ranks 10 failed to pass monitoredBarrier in 1800000 ms

Traceback (most recent call last):
  File "train.py", line 1468, in <module>
    net = torch.nn.parallel.DistributedDataParallel(net, device_ids=[local_rank], output_device=local_rank)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 641, in __init__
    dist._verify_params_across_processes(self.process_group, parameters)
RuntimeError: Ranks 10 failed to pass monitoredBarrier in 1800000 ms
torch.distributed.init_process_group(
    backend=distributed_backend,
    init_method=dist_url,
    world_size=world_size,
    rank=world_rank  # so should it be RANK or WORLD_RANK
)
local_rank = os.environ["LOCAL_RANK"]
self_is_main_node = False 
net = torch.nn.parallel.DistributedDataParallel(net, device_ids=[local_rank], output_device=local_rank)

those lines are in train.py.

In pipeline.yaml file, I have these flags. Are these flags problematic?

jobs:
  train:
    type: command
    component: file:train.yaml

    compute: azureml:pose-4node-NC24r #16GPUs w r
    resources:
      instance_count: 4    # number of nodes
    distribution:
      type: pytorch
      process_count_per_instance: 4 # number of gpus
      # process_count_per_instance: 1 # set to 1 for testing purposes

    # NOTE: set env var if needed
    environment_variables:
      NCCL_DEBUG: "INFO" # adjusts the level of info from NCCL tests

      # NCCL_TOPO_FILE: "/opt/microsoft/ndv4-topo.xml" # Use specific topology file for A100

      # NCCL_IB_PCI_RELAXED_ORDERING: "1" # Relaxed Ordering can greatly help the performance of Infiniband networks in virtualized environments.
      NCCL_IB_DISABLE: "1" # force disable infiniband (if set to "1")
      # NCCL_NET_PLUGIN: "none" # to force NET/Plugin off (no rdma/sharp plugin at all)
      # NCCL_NET: "Socket" # to force node-to-node comm to use Socket (slow)
      NCCL_SOCKET_IFNAME: "lo" # to force Socket comm to use eth0 (use NCCL_NET=Socket)

      # UCX_IB_PCI_RELAXED_ORDERING: "on"
      # UCX_TLS: "tcp"
      # UCX_NET_DEVICES: "eth0" # if you have Error: Failed to resolve UCX endpoint...

      CUDA_DEVICE_ORDER: "PCI_BUS_ID" # ordering of gpus

      TORCH_DISTRIBUTED_DEBUG: "DETAIL"

    inputs:
      # data inputs
      train_data: ${{parent.inputs.training_data}}
  

      # data loading
      # batch_size: 256
      batch_size: 32
      manualseed: 42
      num_workers: 5
      prefetch_factor: 4
      persistent_workers: true
      pin_memory: true
      non_blocking: false

      # model
      model_arch: "vgg16"
      model_arch_pretrained: true

      # training
      num_epochs: 1 # for testing purposes
      learning_rate: 0.0001

This errors means that not all ranks constructed the DDP model in a timely fashion at the beginning.

DDP construction requires all participating ranks do to it in a SPMD fashion - I.E. all at the same time and order.

1 Like

Could you please propose a fix for this?

fixed by using dist.barrier right after init_process_group.

I had dist_barrier elsewhere but not here.