Azure MLOps Pipelining -- NCCL WARN [Rem Allocator] Allocation failed & include/alloc.h:48

Disclaimer: cross-posting here on here and on NCCL since I am not sure if my PyTorch code is causing this or problem is with NCCL.

I am training DOPE (Deep Object Pose Estimation) on Azure Cluster. I did made it distributed using PyTorch DistributedDataParallel myself using the script/train.py and going from there. Here’s the gist for that and I am not sure if it is fully correct (if I have applied DDP) correctly.

Original train.py code: Deep_Object_Pose/train.py at master · NVlabs/Deep_Object_Pose · GitHub
DDP version of train.py modified by me: DDP version of DOPE train.py · GitHub

That said, I am using Azure MLOps Pipeline Templates for this task and I am using an Azure GPU cluster with 4 nodes and each node with 4 K80 GPUs. Checking the host-tools.log I do see that GPUs do have a lot of memory left while I get the nccl mem alloc failed and while I do get that message, the training keeps moving forward. It is funny that it happens even if I am setting the batch size for each GPU to 1.

I am using the entire FAT dataset from NVIDIA for training here and the object of interest is cracker box.

I cancelled the job but here’s the log before I do so:

2023/05/16 19:24:26 WARNING mlflow.tracking.fluent: Exception raised while enabling autologging for sklearn: No module named 'sklearn.utils.testing'
696b7901f9044f23b786795c0ef7257b000000:39:39 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth0
696b7901f9044f23b786795c0ef7257b000000:39:39 [0] NCCL INFO Bootstrap : Using eth0:10.0.0.4<0>
696b7901f9044f23b786795c0ef7257b000000:39:39 [0] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
696b7901f9044f23b786795c0ef7257b000000:39:39 [0] NCCL INFO P2P plugin IBext
696b7901f9044f23b786795c0ef7257b000000:39:39 [0] NCCL INFO NCCL_IB_PCI_RELAXED_ORDERING set by environment to 1.
696b7901f9044f23b786795c0ef7257b000000:39:39 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth0
696b7901f9044f23b786795c0ef7257b000000:39:39 [0] NCCL INFO NET/IB : No device found.
696b7901f9044f23b786795c0ef7257b000000:39:39 [0] NCCL INFO NCCL_IB_DISABLE set by environment to 1.
696b7901f9044f23b786795c0ef7257b000000:39:39 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth0
696b7901f9044f23b786795c0ef7257b000000:39:39 [0] NCCL INFO NET/Socket : Using [0]eth0:10.0.0.4<0>
696b7901f9044f23b786795c0ef7257b000000:39:39 [0] NCCL INFO Using network Socket
NCCL version 2.10.3+cuda10.2
696b7901f9044f23b786795c0ef7257b000000:39:293 [0] NCCL INFO Topology detection : could not read /sys/devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A03:00/device:07/VMBUS:01/47505500-0001-0000-3130-444531303244/pci0001:00/0001:00:00.0/../max_link_speed, ignoring
696b7901f9044f23b786795c0ef7257b000000:39:293 [0] NCCL INFO Topology detection : could not read /sys/devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A03:00/device:07/VMBUS:01/47505500-0001-0000-3130-444531303244/pci0001:00/0001:00:00.0/../max_link_width, ignoring
696b7901f9044f23b786795c0ef7257b000000:39:293 [0] NCCL INFO Topology detection : could not read /sys/devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A03:00/device:07/VMBUS:01/47505500-0002-0000-3130-444531303244/pci0002:00/0002:00:00.0/../max_link_speed, ignoring
696b7901f9044f23b786795c0ef7257b000000:39:293 [0] NCCL INFO Topology detection : could not read /sys/devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A03:00/device:07/VMBUS:01/47505500-0002-0000-3130-444531303244/pci0002:00/0002:00:00.0/../max_link_width, ignoring
696b7901f9044f23b786795c0ef7257b000000:39:293 [0] NCCL INFO Topology detection : could not read /sys/devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A03:00/device:07/VMBUS:01/47505500-0003-0000-3130-444531303244/pci0003:00/0003:00:00.0/../max_link_speed, ignoring
696b7901f9044f23b786795c0ef7257b000000:39:293 [0] NCCL INFO Topology detection : could not read /sys/devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A03:00/device:07/VMBUS:01/47505500-0003-0000-3130-444531303244/pci0003:00/0003:00:00.0/../max_link_width, ignoring
696b7901f9044f23b786795c0ef7257b000000:39:293 [0] NCCL INFO Topology detection : could not read /sys/devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A03:00/device:07/VMBUS:01/47505500-0004-0000-3130-444531303244/pci0004:00/0004:00:00.0/../max_link_speed, ignoring
696b7901f9044f23b786795c0ef7257b000000:39:293 [0] NCCL INFO Topology detection : could not read /sys/devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A03:00/device:07/VMBUS:01/47505500-0004-0000-3130-444531303244/pci0004:00/0004:00:00.0/../max_link_width, ignoring
696b7901f9044f23b786795c0ef7257b000000:39:293 [0] NCCL INFO Topology detection: network path /sys/devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A03:00/device:07/VMBUS:01/000d3a02-42ea-000d-3a02-42ea000d3a02 is not a PCI device (vmbus). Attaching to first CPU
696b7901f9044f23b786795c0ef7257b000000:39:293 [0] NCCL INFO KV Convert to int : could not find value of '' in dictionary, falling back to 60
696b7901f9044f23b786795c0ef7257b000000:39:293 [0] NCCL INFO KV Convert to int : could not find value of '' in dictionary, falling back to 60
696b7901f9044f23b786795c0ef7257b000000:39:293 [0] NCCL INFO KV Convert to int : could not find value of '' in dictionary, falling back to 60
696b7901f9044f23b786795c0ef7257b000000:39:293 [0] NCCL INFO KV Convert to int : could not find value of '' in dictionary, falling back to 60
696b7901f9044f23b786795c0ef7257b000000:39:293 [0] NCCL INFO Attribute coll of node net not found
696b7901f9044f23b786795c0ef7257b000000:39:293 [0] NCCL INFO === System : maxWidth 5.0 totalWidth 12.0 ===
696b7901f9044f23b786795c0ef7257b000000:39:293 [0] NCCL INFO CPU/0 (1/1/1)
696b7901f9044f23b786795c0ef7257b000000:39:293 [0] NCCL INFO + PCI[5000.0] - NIC/0
696b7901f9044f23b786795c0ef7257b000000:39:293 [0] NCCL INFO                 + NET[5.0] - NET/0 (0/0/5.000000)
696b7901f9044f23b786795c0ef7257b000000:39:293 [0] NCCL INFO + PCI[12.0] - GPU/100000 (0)
696b7901f9044f23b786795c0ef7257b000000:39:293 [0] NCCL INFO + PCI[12.0] - GPU/200000 (1)
696b7901f9044f23b786795c0ef7257b000000:39:293 [0] NCCL INFO + PCI[12.0] - GPU/300000 (2)
696b7901f9044f23b786795c0ef7257b000000:39:293 [0] NCCL INFO + PCI[12.0] - GPU/400000 (3)
696b7901f9044f23b786795c0ef7257b000000:39:293 [0] NCCL INFO ==========================================
696b7901f9044f23b786795c0ef7257b000000:39:293 [0] NCCL INFO GPU/100000 :GPU/100000 (0/5000.000000/LOC) GPU/200000 (2/12.000000/PHB) GPU/300000 (2/12.000000/PHB) GPU/400000 (2/12.000000/PHB) CPU/0 (1/12.000000/PHB) NET/0 (3/5.000000/PHB) 
696b7901f9044f23b786795c0ef7257b000000:39:293 [0] NCCL INFO GPU/200000 :GPU/100000 (2/12.000000/PHB) GPU/200000 (0/5000.000000/LOC) GPU/300000 (2/12.000000/PHB) GPU/400000 (2/12.000000/PHB) CPU/0 (1/12.000000/PHB) NET/0 (3/5.000000/PHB) 
696b7901f9044f23b786795c0ef7257b000000:39:293 [0] NCCL INFO GPU/300000 :GPU/100000 (2/12.000000/PHB) GPU/200000 (2/12.000000/PHB) GPU/300000 (0/5000.000000/LOC) GPU/400000 (2/12.000000/PHB) CPU/0 (1/12.000000/PHB) NET/0 (3/5.000000/PHB) 
696b7901f9044f23b786795c0ef7257b000000:39:293 [0] NCCL INFO GPU/400000 :GPU/100000 (2/12.000000/PHB) GPU/200000 (2/12.000000/PHB) GPU/300000 (2/12.000000/PHB) GPU/400000 (0/5000.000000/LOC) CPU/0 (1/12.000000/PHB) NET/0 (3/5.000000/PHB) 
696b7901f9044f23b786795c0ef7257b000000:39:293 [0] NCCL INFO NET/0 :GPU/100000 (3/5.000000/PHB) GPU/200000 (3/5.000000/PHB) GPU/300000 (3/5.000000/PHB) GPU/400000 (3/5.000000/PHB) CPU/0 (2/5.000000/PHB) NET/0 (0/5000.000000/LOC) 
696b7901f9044f23b786795c0ef7257b000000:39:293 [0] NCCL INFO Pattern 4, crossNic 0, nChannels 1, speed 5.000000/5.000000, type PHB/PHB, sameChannels 1
696b7901f9044f23b786795c0ef7257b000000:39:293 [0] NCCL INFO  0 : NET/0 GPU/0 GPU/1 GPU/2 GPU/3 NET/0
696b7901f9044f23b786795c0ef7257b000000:39:293 [0] NCCL INFO Pattern 1, crossNic 0, nChannels 1, speed 6.000000/5.000000, type PHB/PHB, sameChannels 1
696b7901f9044f23b786795c0ef7257b000000:39:293 [0] NCCL INFO  0 : NET/0 GPU/0 GPU/1 GPU/2 GPU/3 NET/0
696b7901f9044f23b786795c0ef7257b000000:39:293 [0] NCCL INFO Pattern 3, crossNic 0, nChannels 0, speed 0.000000/0.000000, type NVL/PIX, sameChannels 1
696b7901f9044f23b786795c0ef7257b000000:39:293 [0] NCCL INFO Tree 0 : -1 -> 0 -> 1/8/-1
696b7901f9044f23b786795c0ef7257b000000:39:293 [0] NCCL INFO Tree 1 : 4 -> 0 -> 1/-1/-1
696b7901f9044f23b786795c0ef7257b000000:39:293 [0] NCCL INFO Channel 00/02 :    0   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15
696b7901f9044f23b786795c0ef7257b000000:39:293 [0] NCCL INFO Channel 01/02 :    0   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15
696b7901f9044f23b786795c0ef7257b000000:39:293 [0] NCCL INFO Ring 00 : 15 -> 0 -> 1
696b7901f9044f23b786795c0ef7257b000000:39:293 [0] NCCL INFO Ring 01 : 15 -> 0 -> 1
696b7901f9044f23b786795c0ef7257b000000:39:293 [0] NCCL INFO Trees [0] 1/8/-1->0->-1 [1] 1/-1/-1->0->4
696b7901f9044f23b786795c0ef7257b000000:39:293 [0] NCCL INFO Setting affinity for GPU 0 to 0fff
696b7901f9044f23b786795c0ef7257b000000:39:293 [0] NCCL INFO Channel 00 : 15[400000] -> 0[100000] [receive] via NET/Socket/0
696b7901f9044f23b786795c0ef7257b000000:39:293 [0] NCCL INFO Channel 01 : 15[400000] -> 0[100000] [receive] via NET/Socket/0
696b7901f9044f23b786795c0ef7257b000000:39:293 [0] NCCL INFO Channel 00 : 0[100000] -> 1[200000] via direct shared memory
696b7901f9044f23b786795c0ef7257b000000:39:293 [0] NCCL INFO Channel 01 : 0[100000] -> 1[200000] via direct shared memory
696b7901f9044f23b786795c0ef7257b000000:39:293 [0] NCCL INFO Connected all rings
696b7901f9044f23b786795c0ef7257b000000:39:293 [0] NCCL INFO Channel 01 : 0[100000] -> 4[100000] [send] via NET/Socket/0
696b7901f9044f23b786795c0ef7257b000000:39:293 [0] NCCL INFO Channel 00 : 8[100000] -> 0[100000] [receive] via NET/Socket/0
696b7901f9044f23b786795c0ef7257b000000:39:293 [0] NCCL INFO Channel 00 : 0[100000] -> 8[100000] [send] via NET/Socket/0
696b7901f9044f23b786795c0ef7257b000000:39:293 [0] NCCL INFO Channel 01 : 4[100000] -> 0[100000] [receive] via NET/Socket/0
696b7901f9044f23b786795c0ef7257b000000:39:293 [0] NCCL INFO Connected all trees
696b7901f9044f23b786795c0ef7257b000000:39:293 [0] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 8/8/512
696b7901f9044f23b786795c0ef7257b000000:39:293 [0] NCCL INFO 2 coll channels, 2 p2p channels, 1 p2p channels per peer
696b7901f9044f23b786795c0ef7257b000000:39:293 [0] NCCL INFO comm 0x14fc50001240 rank 0 nranks 16 cudaDev 0 busId 100000 - Init COMPLETE
696b7901f9044f23b786795c0ef7257b000000:39:39 [0] NCCL INFO Launch mode Parallel
Downloading: "https://download.pytorch.org/models/vgg19-dcbb9e9d.pth" to /root/.cache/torch/hub/checkpoints/vgg19-dcbb9e9d.pth
CPython
3.8.13
uname_result(system='Linux', node='696b7901f9044f23b786795c0ef7257b000000', release='5.15.0-1029-azure', version='#36~20.04.1-Ubuntu SMP Tue Dec 6 17:00:26 UTC 2022', machine='x86_64', processor='x86_64')
training script path:  /mnt/azureml/cr/j/29c1643998874ef7b6bfd57858b8c0ea/exe/wd
start: 19:24:26.818894
manual seed set to 4646
opt.checkpoints = /mnt/azureml/cr/j/29c1643998874ef7b6bfd57858b8c0ea/cap/data-capability/wd/checkpoints
world size is:  16
global rank is 0 and local_rank is 0
is_distributed is True and batch_size is 1
os.getpid() is 39 and initializing process group with {'MASTER_ADDR': '10.0.0.4', 'MASTER_PORT': '6105', 'LOCAL_RANK': '0', 'RANK': '0', 'WORLD_SIZE': '16'}
device is cuda:0
MLflow version: 1.25.1
Tracking URI: azureml:URI
Artifact URI: azureml:URI
load data
train data size:  246000
training data len:  246000
batch size is:  1
training data: 15375 batches
load models
torch.cuda.device_count():  4
type opt.gpuids: <class 'list'>
gpuids are: [0, 1, 2, 3]
Training network pretrained on imagenet.

  0%|          | 0.00/548M [00:00<?, ?B/s]
  
100%|██████████| 548M/548M [00:02<00:00, 259MB/s]

696b7901f9044f23b786795c0ef7257b000000:39:298 [0] include/socket.h:423 NCCL WARN Net : Connection closed by remote peer 10.0.0.4<52638>
696b7901f9044f23b786795c0ef7257b000000:39:298 [0] NCCL INFO include/socket.h:445 -> 2
696b7901f9044f23b786795c0ef7257b000000:39:298 [0] NCCL INFO include/socket.h:457 -> 2
696b7901f9044f23b786795c0ef7257b000000:39:298 [0] NCCL INFO bootstrap.cc:229 -> 2

696b7901f9044f23b786795c0ef7257b000000:39:298 [0] bootstrap.cc:279 NCCL WARN [Rem Allocator] Allocation failed (segment 0, fd 130)

696b7901f9044f23b786795c0ef7257b000000:39:298 [0] include/alloc.h:48 NCCL WARN Cuda failure 'out of memory'
696b7901f9044f23b786795c0ef7257b000000:39:298 [0] NCCL INFO bootstrap.cc:231 -> 1

696b7901f9044f23b786795c0ef7257b000000:39:298 [0] bootstrap.cc:279 NCCL WARN [Rem Allocator] Allocation failed (segment 0, fd 130)



696b7901f9044f23b786795c0ef7257b000000:39:298 [0] bootstrap.cc:279 NCCL WARN [Rem Allocator] Allocation failed (segment 0, fd 130)
Train Epoch: 1 [0/246000 (0%)]	Loss: 0.036680415272713
epoch is: 1 and train loss is 0.03668041527271271
...
epoch is: 1 and train loss is 3.1753609164297814e-06
Train Epoch: 1 [2700/246000 (18%)]	Loss: 0.000002425569619
epoch is: 1 and train loss is 2.4255696189356968e-06
...
epoch is: 3 and train loss is 1.1219314899335586e-07
Train Epoch: 3 [5800/246000 (38%)]	Loss: 0.006392419338226

696b7901f9044f23b786795c0ef7257b000000:39:298 [0] include/socket.h:423 NCCL WARN Net : Connection closed by remote peer 10.0.0.4<44028>
696b7901f9044f23b786795c0ef7257b000000:39:298 [0] NCCL INFO include/socket.h:445 -> 2
696b7901f9044f23b786795c0ef7257b000000:39:298 [0] NCCL INFO include/socket.h:457 -> 2
696b7901f9044f23b786795c0ef7257b000000:39:298 [0] NCCL INFO bootstrap.cc:229 -> 2

696b7901f9044f23b786795c0ef7257b000000:39:298 [0] bootstrap.cc:279 NCCL WARN [Rem Allocator] Allocation failed (segment 0, fd 130)

696b7901f9044f23b786795c0ef7257b000000:39:298 [0] include/alloc.h:48 NCCL WARN Cuda failure 'out of memory'
696b7901f9044f23b786795c0ef7257b000000:39:298 [0] NCCL INFO bootstrap.cc:231 -> 1

696b7901f9044f23b786795c0ef7257b000000:39:298 [0] bootstrap.cc:279 NCCL WARN [Rem Allocator] Allocation failed (segment 0, fd 130)

696b7901f9044f23b786795c0ef7257b000000:39:298 [0] include/alloc.h:48 NCCL WARN Cuda failure 'out of memory'
696b7901f9044f23b786795c0ef7257b000000:39:298 [0] NCCL INFO bootstrap.cc:231 -> 1

696b7901f9044f23b786795c0ef7257b000000:39:298 [0] bootstrap.cc:279 NCCL WARN [Rem Allocator] Allocation failed (segment 0, fd 130)

696b7901f9044f23b786795c0ef7257b000000:39:298 [0] include/alloc.h:48 NCCL WARN Cuda failure 'out of memory'
696b7901f9044f23b786795c0ef7257b000000:39:298 [0] NCCL INFO bootstrap.cc:231 -> 1

696b7901f9044f23b786795c0ef7257b000000:39:298 [0] bootstrap.cc:279 NCCL WARN [Rem Allocator] Allocation failed (segment 0, fd 130)

696b7901f9044f23b786795c0ef7257b000000:39:298 [0] include/alloc.h:48 NCCL WARN Cuda failure 'out of memory'
696b7901f9044f23b786795c0ef7257b000000:39:298 [0] NCCL INFO bootstrap.cc:231 -> 1

696b7901f9044f23b786795c0ef7257b000000:39:298 [0] bootstrap.cc:279 NCCL WARN [Rem Allocator] Allocation failed (segment 0, fd 130)

696b7901f9044f23b786795c0ef7257b000000:39:298 [0] include/alloc.h:48 NCCL WARN Cuda failure 'out of memory'
696b7901f9044f23b786795c0ef7257b000000:39:298 [0] NCCL INFO bootstrap.cc:231 -> 1

696b7901f9044f23b786795c0ef7257b000000:39:298 [0] bootstrap.cc:279 NCCL WARN [Rem Allocator] Allocation failed (segment 0, fd 130)

696b7901f9044f23b786795c0ef7257b000000:39:298 [0] include/alloc.h:48 NCCL WARN Cuda failure 'out of memory'
696b7901f9044f23b786795c0ef7257b000000:39:298 [0] NCCL INFO bootstrap.cc:231 -> 1

696b7901f9044f23b786795c0ef7257b000000:39:298 [0] bootstrap.cc:279 NCCL WARN [Rem Allocator] Allocation failed (segment 0, fd 130)

696b7901f9044f23b786795c0ef7257b000000:39:298 [0] include/alloc.h:48 NCCL WARN Cuda failure 'out of memory'
696b7901f9044f23b786795c0ef7257b000000:39:298 [0] NCCL INFO bootstrap.cc:231 -> 1

696b7901f9044f23b786795c0ef7257b000000:39:298 [0] bootstrap.cc:279 NCCL WARN [Rem Allocator] Allocation failed (segment 0, fd 130)

696b7901f9044f23b786795c0ef7257b000000:39:298 [0] include/alloc.h:48 NCCL WARN Cuda failure 'out of memory'
696b7901f9044f23b786795c0ef7257b000000:39:298 [0] NCCL INFO bootstrap.cc:231 -> 1

696b7901f9044f23b786795c0ef7257b000000:39:298 [0] bootstrap.cc:279 NCCL WARN [Rem Allocator] Allocation failed (segment 0, fd 130)

696b7901f9044f23b786795c0ef7257b000000:39:298 [0] include/alloc.h:48 NCCL WARN Cuda failure 'out of memory'
696b7901f9044f23b786795c0ef7257b000000:39:298 [0] NCCL INFO bootstrap.cc:231 -> 1

696b7901f9044f23b786795c0ef7257b000000:39:298 [0] bootstrap.cc:279 NCCL WARN [Rem Allocator] Allocation failed (segment 0, fd 130)

696b7901f9044f23b786795c0ef7257b000000:39:298 [0] include/alloc.h:48 NCCL WARN Cuda failure 'out of memory'
696b7901f9044f23b786795c0ef7257b000000:39:298 [0] NCCL INFO bootstrap.cc:231 -> 1

696b7901f9044f23b786795c0ef7257b000000:39:298 [0] bootstrap.cc:279 NCCL WARN [Rem Allocator] Allocation failed (segment 0, fd 130)

696b7901f9044f23b786795c0ef7257b000000:39:298 [0] include/alloc.h:48 NCCL WARN Cuda failure 'out of memory'
696b7901f9044f23b786795c0ef7257b000000:39:298 [0] NCCL INFO bootstrap.cc:231 -> 1

696b7901f9044f23b786795c0ef7257b000000:39:298 [0] bootstrap.cc:279 NCCL WARN [Rem Allocator] Allocation failed (segment 0, fd 130)

696b7901f9044f23b786795c0ef7257b000000:39:298 [0] include/alloc.h:48 NCCL WARN Cuda failure 'out of memory'
696b7901f9044f23b786795c0ef7257b000000:39:298 [0] NCCL INFO bootstrap.cc:231 -> 1

696b7901f9044f23b786795c0ef7257b000000:39:298 [0] bootstrap.cc:279 NCCL WARN [Rem Allocator] Allocation failed (segment 0, fd 130)
epoch is: 3 and train loss is 0.006392419338226318
Train Epoch: 3 [5900/246000 (38%)]	Loss: 0.000000082420129
...
epoch is: 3 and train loss is 1.4848103546682978e-07
Train Epoch: 3 [13100/246000 (85%)]	Loss: 0.000000773773365

Here’s my Dockerfile:

# check release notes https://docs.nvidia.com/deeplearning/frameworks/pytorch-release-notes/index.html
FROM nvcr.io/nvidia/pytorch:22.04-py3

##############################################################################
# NCCL TESTS
##############################################################################
ENV NCCL_TESTS_TAG=v2.11.0

# NOTE: adding gencodes to support K80, M60, V100, A100
RUN mkdir /tmp/nccltests && \
    cd /tmp/nccltests && \
    git clone -b ${NCCL_TESTS_TAG} https://github.com/NVIDIA/nccl-tests.git && \
    cd nccl-tests && \
    make \
    MPI=1 MPI_HOME=/opt/hpcx/ompi \
    NVCC_GENCODE="-gencode=arch=compute_35,code=sm_35 -gencode=arch=compute_50,code=sm_50 -gencode=arch=compute_60,code=sm_60 -gencode=arch=compute_61,code=sm_61 -gencode=arch=compute_70,code=sm_70 -gencode=arch=compute_80,code=sm_80" \
    CUDA_HOME=/usr/local/cuda && \
    cp ./build/* /usr/local/bin && \
    rm -rf /tmp/nccltests

# Install dependencies missing in this container
# NOTE: container already has matplotlib==3.5.1 tqdm==4.62.0
COPY requirements.txt ./
RUN pip install -r requirements.txt

# RUN python -m pip install   azureml-defaults==1.41.0 \
#     mlflow==1.25.1 \
#     azureml-mlflow==1.41.0 \
#     transformers==4.18.0 \
#     psutil==5.9.0

# add ndv4-topo.xml
RUN mkdir /opt/microsoft/
ADD ./ndv4-topo.xml /opt/microsoft

# to use on A100, enable env var below in your job
# ENV NCCL_TOPO_FILE="/opt/microsoft/ndv4-topo.xml"

# adjusts the level of info from NCCL tests
ENV NCCL_DEBUG="INFO"
ENV NCCL_DEBUG_SUBSYS="GRAPH,INIT,ENV"

# Relaxed Ordering can greatly help the performance of Infiniband networks in virtualized environments.
ENV NCCL_IB_PCI_RELAXED_ORDERING="1"
ENV CUDA_DEVICE_ORDER="PCI_BUS_ID"
ENV NCCL_SOCKET_IFNAME="eth0"

and here’s my requirements.txt that I am using for Azure MLOps Pipeline Templates:

albumentations==1.3.0
ConfigParser==5.3.0
horovod==0.27.0
matplotlib==3.7.0
numpy==1.24.2
nvisii==1.1.72
Pillow==9.4.0
profiling==0.1.3
psutil==5.9.0
pyquaternion==0.9.9
pyrealsense2==2.53.1.4623
pyrender==0.1.45
pyrr==0.10.3
PyYAML==6.0
scipy==1.10.1
seaborn==0.12.2
simplejson==3.18.4
tensorboardX==2.6
torchvision==0.12.0
torch==1.11.0
tqdm==4.64.1
opencv-python-headless==4.1.2.30
transformers==4.18.0
mlflow==1.25.1
azureml-mlflow==1.41.0

@fduwjj I get the same error when I use single node multi GPU (same K80 GPUs) for a much simpler code and dataset (CIFAR10 + pretrained ResNet50):

note:

I have the same problem when using single node, multi-GPU azure cluster (similarly, K80 GPUs) for CIFAR10 using pretraining ResNet-50 network.

here is my train.py:

import time
import torch
import torch.distributed as dist
import torch.nn as nn
import torch.optim as optim
import torchvision
import torchvision.transforms as transforms
import matplotlib.pyplot as plt
import numpy as np
import mlflow
import os
import datetime

import configparser
import logging
import argparse

from PIL import Image

import ssl
ssl._create_default_https_context = ssl._create_unverified_context


start_time = time.time()


print("MLflow version:", mlflow.__version__)
print("Tracking URI:", mlflow.get_tracking_uri())
print("Artifact URI:", mlflow.get_artifact_uri())

# Set the seed for reproducibility
torch.manual_seed(42)

# Set up the data loading parameters
batch_size = 128
num_epochs = 100
num_workers = 4
pin_memory = True

# Get the world size and rank to determine the process group
world_size = int(os.environ['WORLD_SIZE'])
world_rank = int(os.environ['RANK'])
local_rank = int(os.environ['LOCAL_RANK'])

print("World size:", world_size)
print("local rank is {} and world rank is {}".format(local_rank, world_rank))

is_distributed = world_size > 1

if is_distributed:
    batch_size = batch_size // world_size
    batch_size = max(batch_size, 1)

# Set the backend to NCCL for distributed training
dist.init_process_group(backend="nccl",
                        init_method="env://",
                        world_size=world_size,
                        rank=world_rank)

# Set the device to the current local rank
torch.cuda.set_device(local_rank)
device = torch.device('cuda', local_rank)

dist.barrier()

# Define the transforms for the dataset
transform_train = transforms.Compose([
    transforms.RandomCrop(32, padding=4),
    transforms.RandomHorizontalFlip(),
    transforms.ToTensor(),
    transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))
])

transform_test = transforms.Compose([
    transforms.ToTensor(),
])

# Load the CIFAR-10 dataset

data_root = './data_' + str(world_rank)
train_dataset = torchvision.datasets.CIFAR10(root=data_root, train=True, download=True, transform=transform_train)
train_sampler = torch.utils.data.distributed.DistributedSampler(dataset=train_dataset, num_replicas=world_size, rank=world_rank, shuffle=True) if is_distributed else None
train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=batch_size, shuffle=(train_sampler is None), num_workers=num_workers, pin_memory=pin_memory, sampler=train_sampler)

test_dataset = torchvision.datasets.CIFAR10(root=data_root, train=False, download=True, transform=transform_test)
test_loader = torch.utils.data.DataLoader(test_dataset, batch_size=batch_size, shuffle=False, num_workers=num_workers, pin_memory=pin_memory)

# Define the ResNet50 model
model = torchvision.models.resnet50(pretrained=True)
num_features = model.fc.in_features
model.fc = nn.Linear(num_features, 10)

# Move the model to the GPU
model = model.to(device)

# Wrap the model with DistributedDataParallel
if is_distributed:
    model = nn.parallel.DistributedDataParallel(model, device_ids=[local_rank], output_device=local_rank)

# Define the loss function and optimizer
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

# Train the model for the specified number of epochs
for epoch in range(num_epochs):
    running_loss = 0.0
    train_sampler.set_epoch(epoch) ### why is this line necessary??
    for batch_idx, (inputs, labels) in enumerate(train_loader):
        inputs, labels = inputs.to(device), labels.to(device)

        optimizer.zero_grad()

        outputs = model(inputs)
        loss = criterion(outputs, labels)
        loss.backward()

        optimizer.step()

        running_loss += loss.item()

    print('[Epoch %d] loss: %.3f' % (epoch + 1, running_loss))
    if world_rank == 0:
        # Log the loss and running loss as MLFlow metrics
        mlflow.log_metric("loss", loss.item())
        mlflow.log_metric("running loss", running_loss)

dist.barrier()
# Save the trained model
if world_rank == 0:
    checkpoints_path = "train_checkpoints"
    os.makedirs(checkpoints_path, exist_ok=True)
    torch.save(model.state_dict(), '{}/{}-{}.pth'.format(checkpoints_path, 'resnet50_cifar10', world_rank))
    mlflow.pytorch.log_model(model, "resnet50_cifar10_{}.pth".format(world_rank))
    # mlflow.log_artifact('{}/{}-{}.pth'.format(checkpoints_path, 'resnet50_cifar10', world_rank), artifact_path="model_state_dict")

# Evaluate the model on the test set and save inference on 6 random images
correct = 0
total = 0
with torch.no_grad():
    fig, axs = plt.subplots(2, 3, figsize=(8, 6), dpi=100)
    axs = axs.flatten()
    count = 0
    for data in test_loader:
        if count == 6:
            break
        inputs, labels = data
        inputs, labels = inputs.to(device), labels.to(device)

        outputs = model(inputs)
        _, predicted = torch.max(outputs.data, 1)
        total += labels.size(0)
        correct += (predicted == labels).sum().item()

        # Save the inference on the 6 random images
        if count < 6:
            image = np.transpose(inputs[0].cpu().numpy(), (1, 2, 0))
            confidence = torch.softmax(outputs, dim=1)[0][predicted[0]].cpu().numpy()
            class_name = test_dataset.classes[predicted[0]]
            axs[count].imshow(image)
            axs[count].set_title(f'Class: {class_name}\nConfidence: {confidence:.2f}')
            axs[count].axis('off')
            count += 1

# Average the test accuracy across all processes

correct = torch.tensor(correct, dtype=torch.int8)
correct = correct.to(device)
torch.distributed.all_reduce(correct, op=torch.distributed.ReduceOp.SUM)
total = torch.tensor(total, dtype=torch.torch.int8)
total = total.to(device)
torch.distributed.all_reduce(total, op=torch.distributed.ReduceOp.SUM)
test_accuracy = 100 * correct / total
test_accuracy /= world_size

print('Test accuracy: %.2f %%' % test_accuracy)

# Save the plot with the 6 random images and their predicted classes and prediction confidence
test_img_file_name = 'test_images_' + str(world_rank) + '.png'
plt.savefig(test_img_file_name)

# Log the test accuracy and elapsed time to MLflow
if world_rank == 0:
    mlflow.log_metric("test accuracy", test_accuracy)

end_time = time.time()
elapsed_time = end_time - start_time
print('Elapsed time: ', elapsed_time)
if world_rank == 0:
    mlflow.log_metric("elapsed time", elapsed_time)

# Save the plot with the 6 random images and their predicted classes and prediction confidence as an artifact in MLflow
image = Image.open(test_img_file_name)
image = image.convert('RGBA')
image_buffer = np.array(image)
image_buffer = image_buffer[:, :, [2, 1, 0, 3]]
image_buffer = np.ascontiguousarray(image_buffer)
artifact_file_name = "inference_on_test_images_" + str(world_rank) + ".png"
mlflow.log_image(image_buffer, artifact_file=artifact_file_name)

# End the MLflow run
if mlflow.active_run():
    mlflow.end_run()

dist.destroy_process_group()

here’s the NCCL mem alloc failed message:

fec33820a43e42eaa326fabda0a0a7b3000001:37:37 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth0
fec33820a43e42eaa326fabda0a0a7b3000001:37:37 [0] NCCL INFO Bootstrap : Using eth0:10.0.0.5<0>
fec33820a43e42eaa326fabda0a0a7b3000001:37:37 [0] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
fec33820a43e42eaa326fabda0a0a7b3000001:37:37 [0] NCCL INFO P2P plugin IBext
fec33820a43e42eaa326fabda0a0a7b3000001:37:37 [0] NCCL INFO NCCL_IB_PCI_RELAXED_ORDERING set by environment to 1.
fec33820a43e42eaa326fabda0a0a7b3000001:37:37 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth0
fec33820a43e42eaa326fabda0a0a7b3000001:37:37 [0] NCCL INFO NET/IB : No device found.
fec33820a43e42eaa326fabda0a0a7b3000001:37:37 [0] NCCL INFO NCCL_IB_DISABLE set by environment to 1.
fec33820a43e42eaa326fabda0a0a7b3000001:37:37 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth0
fec33820a43e42eaa326fabda0a0a7b3000001:37:37 [0] NCCL INFO NET/Socket : Using [0]eth0:10.0.0.5<0>
fec33820a43e42eaa326fabda0a0a7b3000001:37:37 [0] NCCL INFO Using network Socket
NCCL version 2.10.3+cuda10.2
fec33820a43e42eaa326fabda0a0a7b3000001:37:208 [0] NCCL INFO Topology detection : could not read /sys/devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A03:00/device:07/VMBUS:01/47505500-0001-0000-3130-444531303244/pci0001:00/0001:00:00.0/../max_link_speed, ignoring
fec33820a43e42eaa326fabda0a0a7b3000001:37:208 [0] NCCL INFO Topology detection : could not read /sys/devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A03:00/device:07/VMBUS:01/47505500-0001-0000-3130-444531303244/pci0001:00/0001:00:00.0/../max_link_width, ignoring
fec33820a43e42eaa326fabda0a0a7b3000001:37:208 [0] NCCL INFO Topology detection : could not read /sys/devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A03:00/device:07/VMBUS:01/47505500-0002-0000-3130-444531303244/pci0002:00/0002:00:00.0/../max_link_speed, ignoring
fec33820a43e42eaa326fabda0a0a7b3000001:37:208 [0] NCCL INFO Topology detection : could not read /sys/devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A03:00/device:07/VMBUS:01/47505500-0002-0000-3130-444531303244/pci0002:00/0002:00:00.0/../max_link_width, ignoring
fec33820a43e42eaa326fabda0a0a7b3000001:37:208 [0] NCCL INFO Topology detection : could not read /sys/devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A03:00/device:07/VMBUS:01/47505500-0003-0000-3130-444531303244/pci0003:00/0003:00:00.0/../max_link_speed, ignoring
fec33820a43e42eaa326fabda0a0a7b3000001:37:208 [0] NCCL INFO Topology detection : could not read /sys/devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A03:00/device:07/VMBUS:01/47505500-0003-0000-3130-444531303244/pci0003:00/0003:00:00.0/../max_link_width, ignoring
fec33820a43e42eaa326fabda0a0a7b3000001:37:208 [0] NCCL INFO Topology detection : could not read /sys/devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A03:00/device:07/VMBUS:01/47505500-0004-0000-3130-444531303244/pci0004:00/0004:00:00.0/../max_link_speed, ignoring
fec33820a43e42eaa326fabda0a0a7b3000001:37:208 [0] NCCL INFO Topology detection : could not read /sys/devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A03:00/device:07/VMBUS:01/47505500-0004-0000-3130-444531303244/pci0004:00/0004:00:00.0/../max_link_width, ignoring
fec33820a43e42eaa326fabda0a0a7b3000001:37:208 [0] NCCL INFO Topology detection: network path /sys/devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A03:00/device:07/VMBUS:01/6045bdbd-f1de-6045-bdbd-f1de6045bdbd is not a PCI device (vmbus). Attaching to first CPU
fec33820a43e42eaa326fabda0a0a7b3000001:37:208 [0] NCCL INFO KV Convert to int : could not find value of '' in dictionary, falling back to 60
fec33820a43e42eaa326fabda0a0a7b3000001:37:208 [0] NCCL INFO KV Convert to int : could not find value of '' in dictionary, falling back to 60
fec33820a43e42eaa326fabda0a0a7b3000001:37:208 [0] NCCL INFO KV Convert to int : could not find value of '' in dictionary, falling back to 60
fec33820a43e42eaa326fabda0a0a7b3000001:37:208 [0] NCCL INFO KV Convert to int : could not find value of '' in dictionary, falling back to 60
fec33820a43e42eaa326fabda0a0a7b3000001:37:208 [0] NCCL INFO Attribute coll of node net not found
fec33820a43e42eaa326fabda0a0a7b3000001:37:208 [0] NCCL INFO === System : maxWidth 12.0 totalWidth 12.0 ===
fec33820a43e42eaa326fabda0a0a7b3000001:37:208 [0] NCCL INFO CPU/0 (1/1/1)
fec33820a43e42eaa326fabda0a0a7b3000001:37:208 [0] NCCL INFO + PCI[5000.0] - NIC/0
fec33820a43e42eaa326fabda0a0a7b3000001:37:208 [0] NCCL INFO + PCI[12.0] - GPU/100000 (0)
fec33820a43e42eaa326fabda0a0a7b3000001:37:208 [0] NCCL INFO + PCI[12.0] - GPU/200000 (1)
fec33820a43e42eaa326fabda0a0a7b3000001:37:208 [0] NCCL INFO + PCI[12.0] - GPU/300000 (2)
fec33820a43e42eaa326fabda0a0a7b3000001:37:208 [0] NCCL INFO + PCI[12.0] - GPU/400000 (3)
fec33820a43e42eaa326fabda0a0a7b3000001:37:208 [0] NCCL INFO ==========================================
fec33820a43e42eaa326fabda0a0a7b3000001:37:208 [0] NCCL INFO GPU/100000 :GPU/100000 (0/5000.000000/LOC) GPU/200000 (2/12.000000/PHB) GPU/300000 (2/12.000000/PHB) GPU/400000 (2/12.000000/PHB) CPU/0 (1/12.000000/PHB) 
fec33820a43e42eaa326fabda0a0a7b3000001:37:208 [0] NCCL INFO GPU/200000 :GPU/100000 (2/12.000000/PHB) GPU/200000 (0/5000.000000/LOC) GPU/300000 (2/12.000000/PHB) GPU/400000 (2/12.000000/PHB) CPU/0 (1/12.000000/PHB) 
fec33820a43e42eaa326fabda0a0a7b3000001:37:208 [0] NCCL INFO GPU/300000 :GPU/100000 (2/12.000000/PHB) GPU/200000 (2/12.000000/PHB) GPU/300000 (0/5000.000000/LOC) GPU/400000 (2/12.000000/PHB) CPU/0 (1/12.000000/PHB) 
fec33820a43e42eaa326fabda0a0a7b3000001:37:208 [0] NCCL INFO GPU/400000 :GPU/100000 (2/12.000000/PHB) GPU/200000 (2/12.000000/PHB) GPU/300000 (2/12.000000/PHB) GPU/400000 (0/5000.000000/LOC) CPU/0 (1/12.000000/PHB) 
fec33820a43e42eaa326fabda0a0a7b3000001:37:208 [0] NCCL INFO Pattern 4, crossNic 0, nChannels 1, speed 10.000000/10.000000, type PHB/PIX, sameChannels 1
fec33820a43e42eaa326fabda0a0a7b3000001:37:208 [0] NCCL INFO  0 : GPU/0 GPU/1 GPU/2 GPU/3
fec33820a43e42eaa326fabda0a0a7b3000001:37:208 [0] NCCL INFO Pattern 1, crossNic 0, nChannels 1, speed 10.000000/10.000000, type PHB/PIX, sameChannels 1
fec33820a43e42eaa326fabda0a0a7b3000001:37:208 [0] NCCL INFO  0 : GPU/0 GPU/1 GPU/2 GPU/3
fec33820a43e42eaa326fabda0a0a7b3000001:37:208 [0] NCCL INFO Pattern 3, crossNic 0, nChannels 1, speed 10.000000/10.000000, type PHB/PIX, sameChannels 1
fec33820a43e42eaa326fabda0a0a7b3000001:37:208 [0] NCCL INFO  0 : GPU/0 GPU/1 GPU/2 GPU/3
fec33820a43e42eaa326fabda0a0a7b3000001:37:208 [0] NCCL INFO Tree 0 : -1 -> 0 -> 1/-1/-1
fec33820a43e42eaa326fabda0a0a7b3000001:37:208 [0] NCCL INFO Tree 1 : -1 -> 0 -> 1/-1/-1
fec33820a43e42eaa326fabda0a0a7b3000001:37:208 [0] NCCL INFO Channel 00/02 :    0   1   2   3
fec33820a43e42eaa326fabda0a0a7b3000001:37:208 [0] NCCL INFO Channel 01/02 :    0   1   2   3
fec33820a43e42eaa326fabda0a0a7b3000001:37:208 [0] NCCL INFO Ring 00 : 3 -> 0 -> 1
fec33820a43e42eaa326fabda0a0a7b3000001:37:208 [0] NCCL INFO Ring 01 : 3 -> 0 -> 1
fec33820a43e42eaa326fabda0a0a7b3000001:37:208 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1
fec33820a43e42eaa326fabda0a0a7b3000001:37:208 [0] NCCL INFO Setting affinity for GPU 0 to 0fff
fec33820a43e42eaa326fabda0a0a7b3000001:37:208 [0] NCCL INFO Channel 00 : 0[100000] -> 1[200000] via direct shared memory
fec33820a43e42eaa326fabda0a0a7b3000001:37:208 [0] NCCL INFO Channel 01 : 0[100000] -> 1[200000] via direct shared memory
fec33820a43e42eaa326fabda0a0a7b3000001:37:208 [0] NCCL INFO Connected all rings
fec33820a43e42eaa326fabda0a0a7b3000001:37:208 [0] NCCL INFO Connected all trees
fec33820a43e42eaa326fabda0a0a7b3000001:37:208 [0] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 8/8/512
fec33820a43e42eaa326fabda0a0a7b3000001:37:208 [0] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer
fec33820a43e42eaa326fabda0a0a7b3000001:37:208 [0] NCCL INFO comm 0x153eb0001240 rank 0 nranks 4 cudaDev 0 busId 100000 - Init COMPLETE
fec33820a43e42eaa326fabda0a0a7b3000001:37:37 [0] NCCL INFO Launch mode Parallel
MLflow version: 1.25.1
Tracking URI: azureml:URI
Artifact URI: azureml:URI
World size: 4
local rank is 0 and world rank is 0
Downloading https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz to ./data_0/cifar-10-python.tar.gz

  0%|          | 0/170498071 [00:00<?, ?it/s]
  0%|          | 544768/170498071 [00:00<00:31, 5369680.61it/s]
  2%|▏         | 2880512/170498071 [00:00<00:10, 15844914.53it/s]
  3%|▎         | 5056512/170498071 [00:00<00:09, 17856230.77it/s]
  4%|▍         | 7335936/170498071 [00:00<00:08, 19729892.43it/s]
  6%|▌         | 9648128/170498071 [00:00<00:07, 20931046.37it/s]
  7%|▋         | 11968512/170498071 [00:00<00:07, 21630963.81it/s]
  8%|▊         | 14247936/170498071 [00:00<00:07, 22005088.89it/s]
 10%|▉         | 16520192/170498071 [00:00<00:06, 22113817.44it/s]
 11%|█         | 18784256/170498071 [00:00<00:06, 22276545.60it/s]
 12%|█▏        | 21054464/170498071 [00:01<00:06, 22407118.86it/s]
 14%|█▎        | 23296000/170498071 [00:01<00:06, 22188998.50it/s]
 15%|█▍        | 25516032/170498071 [00:01<00:06, 21661435.98it/s]
 16%|█▌        | 27685888/170498071 [00:01<00:06, 21140440.07it/s]
 17%|█▋        | 29804544/170498071 [00:01<00:06, 20593556.49it/s]
 19%|█▊        | 31868928/170498071 [00:01<00:06, 20156557.50it/s]
 20%|█▉        | 33888256/170498071 [00:01<00:06, 19745679.59it/s]
 21%|██        | 35865600/170498071 [00:01<00:06, 19687404.90it/s]
 22%|██▏       | 37836800/170498071 [00:01<00:06, 19460052.10it/s]
 23%|██▎       | 39784448/170498071 [00:01<00:06, 19222178.56it/s]
 24%|██▍       | 41707520/170498071 [00:02<00:06, 19104469.22it/s]
 26%|██▌       | 43624448/170498071 [00:02<00:06, 19072615.77it/s]
 27%|██▋       | 45544448/170498071 [00:02<00:06, 19046030.11it/s]
 28%|██▊       | 47472640/170498071 [00:02<00:06, 18993926.03it/s]
 29%|██▉       | 49392640/170498071 [00:02<00:06, 18941301.65it/s]
 30%|███       | 51333120/170498071 [00:02<00:06, 19078166.70it/s]
 31%|███       | 53265408/170498071 [00:02<00:06, 19150688.08it/s]
 32%|███▏      | 55181312/170498071 [00:02<00:06, 19015663.83it/s]
 33%|███▎      | 57083904/170498071 [00:02<00:05, 18921455.92it/s]
 35%|███▍      | 59008000/170498071 [00:02<00:05, 18876725.57it/s]
 36%|███▌      | 60896256/170498071 [00:03<00:05, 18868879.10it/s]
 37%|███▋      | 62840832/170498071 [00:03<00:05, 19039215.17it/s]
 38%|███▊      | 64745472/170498071 [00:03<00:05, 18795086.88it/s]
 39%|███▉      | 66632704/170498071 [00:03<00:05, 18810168.86it/s]
 40%|████      | 68552704/170498071 [00:03<00:05, 18886629.32it/s]
 41%|████▏     | 70480896/170498071 [00:03<00:05, 18926663.49it/s]
 42%|████▏     | 72408064/170498071 [00:03<00:05, 19017268.26it/s]
 44%|████▎     | 74384384/170498071 [00:03<00:05, 19154625.24it/s]
 45%|████▍     | 76312576/170498071 [00:03<00:04, 19185430.64it/s]
 46%|████▌     | 78232576/170498071 [00:03<00:04, 19044355.19it/s]
 47%|████▋     | 80138240/170498071 [00:04<00:04, 19032394.16it/s]
 48%|████▊     | 82072576/170498071 [00:04<00:04, 19032762.39it/s]
 49%|████▉     | 84000768/170498071 [00:04<00:04, 18993284.35it/s]
 50%|█████     | 85984256/170498071 [00:04<00:04, 19194041.36it/s]
 52%|█████▏    | 87968768/170498071 [00:04<00:04, 19295312.99it/s]
 53%|█████▎    | 89960448/170498071 [00:04<00:04, 19394267.99it/s]
 54%|█████▍    | 91943936/170498071 [00:04<00:04, 19474106.16it/s]
 55%|█████▌    | 94000128/170498071 [00:04<00:03, 19731032.65it/s]
 56%|█████▋    | 96304128/170498071 [00:04<00:03, 20653426.52it/s]
 58%|█████▊    | 98657280/170498071 [00:05<00:03, 21509811.57it/s]
 59%|█████▉    | 100984832/170498071 [00:05<00:03, 21949022.30it/s]
 61%|██████    | 103304192/170498071 [00:05<00:03, 22318196.75it/s]
 62%|██████▏   | 105656320/170498071 [00:05<00:02, 22563801.79it/s]
 63%|██████▎   | 107984896/170498071 [00:05<00:02, 22772587.92it/s]
 65%|██████▍   | 110336000/170498071 [00:05<00:02, 22893636.05it/s]
 66%|██████▋   | 113224704/170498071 [00:05<00:02, 24611655.12it/s]
 68%|██████▊   | 116120576/170498071 [00:05<00:02, 25873352.76it/s]
 70%|██████▉   | 118992896/170498071 [00:05<00:01, 26672004.11it/s]
 71%|███████▏  | 121872384/170498071 [00:05<00:01, 27269632.47it/s]
 73%|███████▎  | 124768256/170498071 [00:06<00:01, 27738624.97it/s]
 75%|███████▍  | 127640576/170498071 [00:06<00:01, 27994351.27it/s]
 77%|███████▋  | 130528256/170498071 [00:06<00:01, 28237881.13it/s]
 78%|███████▊  | 133408768/170498071 [00:06<00:01, 28395251.23it/s]
 80%|███████▉  | 136296448/170498071 [00:06<00:01, 28490006.83it/s]
 82%|████████▏ | 139175936/170498071 [00:06<00:01, 28557051.70it/s]
 83%|████████▎ | 142064640/170498071 [00:06<00:00, 28576616.43it/s]
 85%|████████▌ | 144948224/170498071 [00:06<00:00, 28653993.24it/s]
 87%|████████▋ | 147896320/170498071 [00:06<00:00, 28869425.59it/s]
 88%|████████▊ | 150784000/170498071 [00:06<00:00, 28754943.51it/s]
 91%|█████████ | 155528192/170498071 [00:07<00:00, 34291822.22it/s]
 95%|█████████▍| 161680384/170498071 [00:07<00:00, 42359886.52it/s]
 98%|█████████▊| 167552000/170498071 [00:07<00:00, 47246998.05it/s]
170499072it [00:07, 23447255.93it/s]                               Extracting ./data_0/cifar-10-python.tar.gz to ./data_0
Files already downloaded and verified

[Epoch 1] loss: 469.946
[Epoch 2] loss: 331.277
[Epoch 3] loss: 288.598
[Epoch 4] loss: 271.003
[Epoch 5] loss: 253.672
[Epoch 6] loss: 238.890
[Epoch 7] loss: 228.931
[Epoch 8] loss: 218.206
[Epoch 9] loss: 220.511
[Epoch 10] loss: 199.832
[Epoch 11] loss: 186.101
[Epoch 12] loss: 188.461
[Epoch 13] loss: 179.636
[Epoch 14] loss: 162.535
[Epoch 15] loss: 180.182
[Epoch 16] loss: 160.623
[Epoch 17] loss: 147.587
[Epoch 18] loss: 146.390
[Epoch 19] loss: 141.057
[Epoch 20] loss: 139.670
[Epoch 21] loss: 135.072
[Epoch 22] loss: 147.974
[Epoch 23] loss: 124.034
[Epoch 24] loss: 127.605
[Epoch 25] loss: 118.408
[Epoch 26] loss: 114.545
[Epoch 27] loss: 116.625
[Epoch 28] loss: 128.930
[Epoch 29] loss: 106.408
[Epoch 30] loss: 104.195
[Epoch 31] loss: 107.757
[Epoch 32] loss: 104.589
[Epoch 33] loss: 92.798
[Epoch 34] loss: 95.119
[Epoch 35] loss: 93.005
[Epoch 36] loss: 89.597
[Epoch 37] loss: 90.558
[Epoch 38] loss: 88.932
[Epoch 39] loss: 82.978
[Epoch 40] loss: 81.889
[Epoch 41] loss: 81.951
[Epoch 42] loss: 82.444
[Epoch 43] loss: 77.162
[Epoch 44] loss: 77.444
[Epoch 45] loss: 77.268

fec33820a43e42eaa326fabda0a0a7b3000001:37:213 [0] include/socket.h:423 NCCL WARN Net : Connection closed by remote peer 10.0.0.5<48088>
fec33820a43e42eaa326fabda0a0a7b3000001:37:213 [0] NCCL INFO include/socket.h:445 -> 2
fec33820a43e42eaa326fabda0a0a7b3000001:37:213 [0] NCCL INFO include/socket.h:457 -> 2
fec33820a43e42eaa326fabda0a0a7b3000001:37:213 [0] NCCL INFO bootstrap.cc:229 -> 2

fec33820a43e42eaa326fabda0a0a7b3000001:37:213 [0] bootstrap.cc:279 NCCL WARN [Rem Allocator] Allocation failed (segment 0, fd 84)

fec33820a43e42eaa326fabda0a0a7b3000001:37:213 [0] include/alloc.h:48 NCCL WARN Cuda failure 'out of memory'
fec33820a43e42eaa326fabda0a0a7b3000001:37:213 [0] NCCL INFO bootstrap.cc:231 -> 1

fec33820a43e42eaa326fabda0a0a7b3000001:37:213 [0] bootstrap.cc:279 NCCL WARN [Rem Allocator] Allocation failed (segment 0, fd 84)

fec33820a43e42eaa326fabda0a0a7b3000001:37:213 [0] include/alloc.h:48 NCCL WARN Cuda failure 'out of memory'
fec33820a43e42eaa326fabda0a0a7b3000001:37:213 [0] NCCL INFO bootstrap.cc:231 -> 1

fec33820a43e42eaa326fabda0a0a7b3000001:37:213 [0] bootstrap.cc:279 NCCL WARN [Rem Allocator] Allocation failed (segment 0, fd 84)

fec33820a43e42eaa326fabda0a0a7b3000001:37:213 [0] include/alloc.h:48 NCCL WARN Cuda failure 'out of memory'
fec33820a43e42eaa326fabda0a0a7b3000001:37:213 [0] NCCL INFO bootstrap.cc:231 -> 1

fec33820a43e42eaa326fabda0a0a7b3000001:37:213 [0] bootstrap.cc:279 NCCL WARN [Rem Allocator] Allocation failed (segment 0, fd 84)

fec33820a43e42eaa326fabda0a0a7b3000001:37:213 [0] include/alloc.h:48 NCCL WARN Cuda failure 'out of memory'
fec33820a43e42eaa326fabda0a0a7b3000001:37:213 [0] NCCL INFO bootstrap.cc:231 -> 1

fec33820a43e42eaa326fabda0a0a7b3000001:37:213 [0] bootstrap.cc:279 NCCL WARN [Rem Allocator] Allocation failed (segment 0, fd 84)

fec33820a43e42eaa326fabda0a0a7b3000001:37:213 [0] include/alloc.h:48 NCCL WARN Cuda failure 'out of memory'
fec33820a43e42eaa326fabda0a0a7b3000001:37:213 [0] NCCL INFO bootstrap.cc:231 -> 1

fec33820a43e42eaa326fabda0a0a7b3000001:37:213 [0] bootstrap.cc:279 NCCL WARN [Rem Allocator] Allocation failed (segment 0, fd 84)

fec33820a43e42eaa326fabda0a0a7b3000001:37:213 [0] include/alloc.h:48 NCCL WARN Cuda failure 'out of memory'
fec33820a43e42eaa326fabda0a0a7b3000001:37:213 [0] NCCL INFO bootstrap.cc:231 -> 1

fec33820a43e42eaa326fabda0a0a7b3000001:37:213 [0] bootstrap.cc:279 NCCL WARN [Rem Allocator] Allocation failed (segment 0, fd 84)

fec33820a43e42eaa326fabda0a0a7b3000001:37:213 [0] include/alloc.h:48 NCCL WARN Cuda failure 'out of memory'
fec33820a43e42eaa326fabda0a0a7b3000001:37:213 [0] NCCL INFO bootstrap.cc:231 -> 1

fec33820a43e42eaa326fabda0a0a7b3000001:37:213 [0] bootstrap.cc:279 NCCL WARN [Rem Allocator] Allocation failed (segment 0, fd 84)

fec33820a43e42eaa326fabda0a0a7b3000001:37:213 [0] include/alloc.h:48 NCCL WARN Cuda failure 'out of memory'
fec33820a43e42eaa326fabda0a0a7b3000001:37:213 [0] NCCL INFO bootstrap.cc:231 -> 1

fec33820a43e42eaa326fabda0a0a7b3000001:37:213 [0] bootstrap.cc:279 NCCL WARN [Rem Allocator] Allocation failed (segment 0, fd 84)

fec33820a43e42eaa326fabda0a0a7b3000001:37:213 [0] include/alloc.h:48 NCCL WARN Cuda failure 'out of memory'
fec33820a43e42eaa326fabda0a0a7b3000001:37:213 [0] NCCL INFO bootstrap.cc:231 -> 1

fec33820a43e42eaa326fabda0a0a7b3000001:37:213 [0] bootstrap.cc:279 NCCL WARN [Rem Allocator] Allocation failed (segment 0, fd 84)

fec33820a43e42eaa326fabda0a0a7b3000001:37:213 [0] include/alloc.h:48 NCCL WARN Cuda failure 'out of memory'
fec33820a43e42eaa326fabda0a0a7b3000001:37:213 [0] NCCL INFO bootstrap.cc:231 -> 1

fec33820a43e42eaa326fabda0a0a7b3000001:37:213 [0] bootstrap.cc:279 NCCL WARN [Rem Allocator] Allocation failed (segment 0, fd 84)

fec33820a43e42eaa326fabda0a0a7b3000001:37:213 [0] include/alloc.h:48 NCCL WARN Cuda failure 'out of memory'
fec33820a43e42eaa326fabda0a0a7b3000001:37:213 [0] NCCL INFO bootstrap.cc:231 -> 1

fec33820a43e42eaa326fabda0a0a7b3000001:37:213 [0] bootstrap.cc:279 NCCL WARN [Rem Allocator] Allocation failed (segment 0, fd 84)

fec33820a43e42eaa326fabda0a0a7b3000001:37:213 [0] include/alloc.h:48 NCCL WARN Cuda failure 'out of memory'
fec33820a43e42eaa326fabda0a0a7b3000001:37:213 [0] NCCL INFO bootstrap.cc:231 -> 1

fec33820a43e42eaa326fabda0a0a7b3000001:37:213 [0] bootstrap.cc:279 NCCL WARN [Rem Allocator] Allocation failed (segment 0, fd 84)
[Epoch 46] loss: 74.304
[Epoch 47] loss: 72.692
[Epoch 48] loss: 76.108
[Epoch 49] loss: 69.997
[Epoch 50] loss: 83.098
[Epoch 51] loss: 74.974
[Epoch 52] loss: 62.839
[Epoch 53] loss: 61.260
[Epoch 54] loss: 60.852
[Epoch 55] loss: 59.302
[Epoch 56] loss: 60.803
[Epoch 57] loss: 58.838