NCCL INFO Topology detection : could not read /sys/devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A03:00/device:07/VMBUS:01/47505500-0001-0000-3130-444531303244/pci7ea5:00/7ea5:00:00.0/../max_link_speed, ignoring

Is getting this error ok? Or should I fix it? What is causing it? I am using NCCL backend when using DistributedDataParallel for training across 4 nodes each with 4 GPUs in Azure Cluster using MLOps pipeline templates.
Additionally, I get this other error: 95394c9183b545dca20c8c4e54176b86000004:49:49 [3] ibvwrap.c:66 NCCL WARN Call to ibv_open_device failed that I am not sure if I should ignore?


2023/05/05 15:35:08 WARNING mlflow.tracking.fluent: Exception raised while enabling autologging for sklearn: No module named 'sklearn.utils.testing'
Downloading: "https://download.pytorch.org/models/vgg19-dcbb9e9d.pth" to /root/.cache/torch/hub/checkpoints/vgg19-dcbb9e9d.pth
CPython
3.8.13
uname_result(system='Linux', node='95394c9183b545dca20c8c4e54176b86000004', release='5.0.0-1036-azure', version='#38-Ubuntu SMP Sun Mar 22 21:27:21 UTC 2020', machine='x86_64', processor='x86_64')
training script path:  /mnt/azureml/cr/j/4eec0c64516140a0bc1f1a83f113dd28/exe/wd
start: 15:35:08.416369
manual seed set to 3566
opt.checkpoints = /mnt/azureml/cr/j/4eec0c64516140a0bc1f1a83f113dd28/cap/data-capability/wd/checkpoints
world size is:  16
global rank is 15 and local_rank is 3
is_distributed is True and batch_size is 2
os.getpid() is 49 and initializing process group with {'MASTER_ADDR': '10.0.0.4', 'MASTER_PORT': '6105', 'LOCAL_RANK': '3', 'RANK': '15', 'WORLD_SIZE': '16'}
device is cuda:3
MLflow version: 1.25.1
Tracking URI: MY_URI
Artifact URI: MY_URI
load data
train data size:  246000
training data len:  246000
batch size is:  2
training data: 7688 batches
load models
torch.cuda.device_count():  4
type opt.gpuids: <class 'list'>
gpuids are: [0, 1, 2, 3]
Training network pretrained on imagenet.

  0%|          | 0.00/548M [00:00<?, ?B/s]
  3%|▎         | 16.3M/548M [00:00<00:03, 167MB/s]
  8%|▊         | 45.4M/548M [00:00<00:02, 247MB/s]
 14%|█▎        | 74.8M/548M [00:00<00:01, 275MB/s]
 19%|█▉        | 104M/548M [00:00<00:01, 286MB/s] 
 24%|██▍       | 133M/548M [00:00<00:01, 293MB/s]
 30%|██▉       | 163M/548M [00:00<00:01, 299MB/s]
 35%|███▌      | 192M/548M [00:00<00:01, 303MB/s]
 40%|████      | 222M/548M [00:00<00:01, 306MB/s]
 46%|████▌     | 251M/548M [00:00<00:01, 304MB/s]
 51%|█████     | 281M/548M [00:01<00:00, 306MB/s]
 57%|█████▋    | 310M/548M [00:01<00:00, 307MB/s]
 62%|██████▏   | 340M/548M [00:01<00:00, 308MB/s]
 67%|██████▋   | 369M/548M [00:01<00:00, 309MB/s]
 73%|███████▎  | 399M/548M [00:01<00:00, 306MB/s]
 78%|███████▊  | 428M/548M [00:01<00:00, 305MB/s]
 83%|████████▎ | 457M/548M [00:01<00:00, 299MB/s]
 89%|████████▊ | 486M/548M [00:01<00:00, 289MB/s]
 94%|█████████▎| 513M/548M [00:01<00:00, 275MB/s]
 99%|█████████▉| 543M/548M [00:01<00:00, 286MB/s]
100%|██████████| 548M/548M [00:01<00:00, 293MB/s]95394c9183b545dca20c8c4e54176b86000004:49:49 [3] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth0
95394c9183b545dca20c8c4e54176b86000004:49:49 [3] NCCL INFO Bootstrap : Using eth0:10.0.0.8<0>
95394c9183b545dca20c8c4e54176b86000004:49:49 [3] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
95394c9183b545dca20c8c4e54176b86000004:49:49 [3] NCCL INFO P2P plugin IBext
95394c9183b545dca20c8c4e54176b86000004:49:49 [3] NCCL INFO NCCL_IB_PCI_RELAXED_ORDERING set by environment to 1.
95394c9183b545dca20c8c4e54176b86000004:49:49 [3] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth0

95394c9183b545dca20c8c4e54176b86000004:49:49 [3] ibvwrap.c:66 NCCL WARN Call to ibv_open_device failed

95394c9183b545dca20c8c4e54176b86000004:49:49 [3] p2p_plugin.c:190 NCCL WARN NET/IB : Unable to open device mlx4_0
95394c9183b545dca20c8c4e54176b86000004:49:49 [3] NCCL INFO NET/IB : No device found.
95394c9183b545dca20c8c4e54176b86000004:49:49 [3] NCCL INFO NCCL_IB_DISABLE set by environment to 1.
95394c9183b545dca20c8c4e54176b86000004:49:49 [3] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth0
95394c9183b545dca20c8c4e54176b86000004:49:49 [3] NCCL INFO NET/Socket : Using [0]eth0:10.0.0.8<0>
95394c9183b545dca20c8c4e54176b86000004:49:49 [3] NCCL INFO Using network Socket
95394c9183b545dca20c8c4e54176b86000004:49:402 [3] NCCL INFO Topology detection : could not read /sys/devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A03:00/device:07/VMBUS:01/47505500-0001-0000-3130-444531303244/pci7ea5:00/7ea5:00:00.0/../max_link_speed, ignoring
95394c9183b545dca20c8c4e54176b86000004:49:402 [3] NCCL INFO Topology detection : could not read /sys/devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A03:00/device:07/VMBUS:01/47505500-0001-0000-3130-444531303244/pci7ea5:00/7ea5:00:00.0/../max_link_width, ignoring
95394c9183b545dca20c8c4e54176b86000004:49:402 [3] NCCL INFO Topology detection : could not read /sys/devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A03:00/device:07/VMBUS:01/47505500-0002-0000-3130-444531303244/pci9fa2:00/9fa2:00:00.0/../max_link_speed, ignoring
95394c9183b545dca20c8c4e54176b86000004:49:402 [3] NCCL INFO Topology detection : could not read /sys/devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A03:00/device:07/VMBUS:01/47505500-0002-0000-3130-444531303244/pci9fa2:00/9fa2:00:00.0/../max_link_width, ignoring
95394c9183b545dca20c8c4e54176b86000004:49:402 [3] NCCL INFO Topology detection : could not read /sys/devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A03:00/device:07/VMBUS:01/47505500-0003-0000-3130-444531303244/pcicb58:00/cb58:00:00.0/../max_link_speed, ignoring
95394c9183b545dca20c8c4e54176b86000004:49:402 [3] NCCL INFO Topology detection : could not read /sys/devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A03:00/device:07/VMBUS:01/47505500-0003-0000-3130-444531303244/pcicb58:00/cb58:00:00.0/../max_link_width, ignoring
95394c9183b545dca20c8c4e54176b86000004:49:402 [3] NCCL INFO Topology detection : could not read /sys/devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A03:00/device:07/VMBUS:01/47505500-0004-0000-3130-444531303244/pcie52e:00/e52e:00:00.0/../max_link_speed, ignoring
95394c9183b545dca20c8c4e54176b86000004:49:402 [3] NCCL INFO Topology detection : could not read /sys/devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A03:00/device:07/VMBUS:01/47505500-0004-0000-3130-444531303244/pcie52e:00/e52e:00:00.0/../max_link_width, ignoring
95394c9183b545dca20c8c4e54176b86000004:49:402 [3] NCCL INFO Topology detection: network path /sys/devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A03:00/device:07/VMBUS:01/000d3ade-d129-000d-3ade-d129000d3ade is not a PCI device (vmbus). Attaching to first CPU
95394c9183b545dca20c8c4e54176b86000004:49:402 [3] NCCL INFO KV Convert to int : could not find value of '' in dictionary, falling back to 60
95394c9183b545dca20c8c4e54176b86000004:49:402 [3] NCCL INFO KV Convert to int : could not find value of '' in dictionary, falling back to 60
95394c9183b545dca20c8c4e54176b86000004:49:402 [3] NCCL INFO KV Convert to int : could not find value of '' in dictionary, falling back to 60
95394c9183b545dca20c8c4e54176b86000004:49:402 [3] NCCL INFO KV Convert to int : could not find value of '' in dictionary, falling back to 60
95394c9183b545dca20c8c4e54176b86000004:49:402 [3] NCCL INFO Attribute coll of node net not found
95394c9183b545dca20c8c4e54176b86000004:49:402 [3] NCCL INFO === System : maxWidth 5.0 totalWidth 12.0 ===
95394c9183b545dca20c8c4e54176b86000004:49:402 [3] NCCL INFO CPU/0 (1/1/1)
95394c9183b545dca20c8c4e54176b86000004:49:402 [3] NCCL INFO + PCI[5000.0] - NIC/0
95394c9183b545dca20c8c4e54176b86000004:49:402 [3] NCCL INFO                 + NET[5.0] - NET/0 (0/0/5.000000)
95394c9183b545dca20c8c4e54176b86000004:49:402 [3] NCCL INFO + PCI[12.0] - GPU/7EA500000 (12)
95394c9183b545dca20c8c4e54176b86000004:49:402 [3] NCCL INFO + PCI[12.0] - GPU/9FA200000 (13)
95394c9183b545dca20c8c4e54176b86000004:49:402 [3] NCCL INFO + PCI[12.0] - GPU/CB5800000 (14)
95394c9183b545dca20c8c4e54176b86000004:49:402 [3] NCCL INFO + PCI[12.0] - GPU/E52E00000 (15)
95394c9183b545dca20c8c4e54176b86000004:49:402 [3] NCCL INFO ==========================================
95394c9183b545dca20c8c4e54176b86000004:49:402 [3] NCCL INFO GPU/7EA500000 :GPU/7EA500000 (0/5000.000000/LOC) GPU/9FA200000 (2/12.000000/PHB) GPU/CB5800000 (2/12.000000/PHB) GPU/E52E00000 (2/12.000000/PHB) CPU/0 (1/12.000000/PHB) NET/0 (3/5.000000/PHB) 
95394c9183b545dca20c8c4e54176b86000004:49:402 [3] NCCL INFO GPU/9FA200000 :GPU/7EA500000 (2/12.000000/PHB) GPU/9FA200000 (0/5000.000000/LOC) GPU/CB5800000 (2/12.000000/PHB) GPU/E52E00000 (2/12.000000/PHB) CPU/0 (1/12.000000/PHB) NET/0 (3/5.000000/PHB) 
95394c9183b545dca20c8c4e54176b86000004:49:402 [3] NCCL INFO GPU/CB5800000 :GPU/7EA500000 (2/12.000000/PHB) GPU/9FA200000 (2/12.000000/PHB) GPU/CB5800000 (0/5000.000000/LOC) GPU/E52E00000 (2/12.000000/PHB) CPU/0 (1/12.000000/PHB) NET/0 (3/5.000000/PHB) 
95394c9183b545dca20c8c4e54176b86000004:49:402 [3] NCCL INFO GPU/E52E00000 :GPU/7EA500000 (2/12.000000/PHB) GPU/9FA200000 (2/12.000000/PHB) GPU/CB5800000 (2/12.000000/PHB) GPU/E52E00000 (0/5000.000000/LOC) CPU/0 (1/12.000000/PHB) NET/0 (3/5.000000/PHB) 
95394c9183b545dca20c8c4e54176b86000004:49:402 [3] NCCL INFO NET/0 :GPU/7EA500000 (3/5.000000/PHB) GPU/9FA200000 (3/5.000000/PHB) GPU/CB5800000 (3/5.000000/PHB) GPU/E52E00000 (3/5.000000/PHB) CPU/0 (2/5.000000/PHB) NET/0 (0/5000.000000/LOC) 
95394c9183b545dca20c8c4e54176b86000004:49:402 [3] NCCL INFO Pattern 4, crossNic 0, nChannels 1, speed 5.000000/5.000000, type PHB/PHB, sameChannels 1
95394c9183b545dca20c8c4e54176b86000004:49:402 [3] NCCL INFO  0 : NET/0 GPU/12 GPU/13 GPU/14 GPU/15 NET/0
95394c9183b545dca20c8c4e54176b86000004:49:402 [3] NCCL INFO Pattern 1, crossNic 0, nChannels 1, speed 6.000000/5.000000, type PHB/PHB, sameChannels 1
95394c9183b545dca20c8c4e54176b86000004:49:402 [3] NCCL INFO  0 : NET/0 GPU/12 GPU/13 GPU/14 GPU/15 NET/0
95394c9183b545dca20c8c4e54176b86000004:49:402 [3] NCCL INFO Pattern 3, crossNic 0, nChannels 0, speed 0.000000/0.000000, type NVL/PIX, sameChannels 1
95394c9183b545dca20c8c4e54176b86000004:49:402 [3] NCCL INFO Ring 00 : 14 -> 15 -> 0
95394c9183b545dca20c8c4e54176b86000004:49:402 [3] NCCL INFO Ring 01 : 14 -> 15 -> 0
95394c9183b545dca20c8c4e54176b86000004:49:402 [3] NCCL INFO Trees [0] -1/-1/-1->15->14 [1] -1/-1/-1->15->14
95394c9183b545dca20c8c4e54176b86000004:49:402 [3] NCCL INFO Setting affinity for GPU 3 to 0fff
95394c9183b545dca20c8c4e54176b86000004:49:402 [3] NCCL INFO Channel 00 : 15[e52e00000] -> 0[894400000] [send] via NET/Socket/0
95394c9183b545dca20c8c4e54176b86000004:49:402 [3] NCCL INFO Channel 01 : 15[e52e00000] -> 0[894400000] [send] via NET/Socket/0
95394c9183b545dca20c8c4e54176b86000004:49:402 [3] NCCL INFO Connected all rings
95394c9183b545dca20c8c4e54176b86000004:49:402 [3] NCCL INFO Channel 00 : 15[e52e00000] -> 14[cb5800000] via direct shared memory
95394c9183b545dca20c8c4e54176b86000004:49:402 [3] NCCL INFO Channel 01 : 15[e52e00000] -> 14[cb5800000] via direct shared memory
95394c9183b545dca20c8c4e54176b86000004:49:402 [3] NCCL INFO Connected all trees
95394c9183b545dca20c8c4e54176b86000004:49:402 [3] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 8/8/512
95394c9183b545dca20c8c4e54176b86000004:49:402 [3] NCCL INFO 2 coll channels, 2 p2p channels, 1 p2p channels per peer
95394c9183b545dca20c8c4e54176b86000004:49:402 [3] NCCL INFO comm 0x154910001240 rank 15 nranks 16 cudaDev 3 busId e52e00000 - Init COMPLETE

did you check NCCL_TOPO_FILE value?