Hello,
I am training a small 1 hidden-layer NN using PyTorch DataParallel approach. Migrating to DistributedDataParallel is my goal, but the following issue is blocking me.
I have two containers that are somewhat out of my control. One has PyTorch 1.9.1, Python 3.8, CUDA 11.7 and the other one has PyTorch 1.13.1, Python3.9 and CUDA 11.7. I am Running these containers on a node with 4 V100 GPUs.
PyTorch 1.9.1 container runs significantly faster (~50%) than the PyTorch 1.13.1 container and I am trying to root-cause this. The only difference I see in the logs is NCCL initialization output that is present in the fast container and absent in the slow container.
Here’s a sample of the NCCL logs from the fast container.
977b3b0f7db8:14:14 [0] NCCL INFO Bootstrap : Using [0]lo:127.0.0.1<0> [1]eth0:172.17.0.3<0>
977b3b0f7db8:14:14 [0] ofi_init:1134 NCCL WARN NET/OFI Only EFA provider is supported
977b3b0f7db8:14:14 [0] NCCL INFO NET/IB : No device found.
977b3b0f7db8:14:14 [0] NCCL INFO NET/Socket : Using [0]lo:127.0.0.1<0> [1]eth0:172.17.0.3<0>
977b3b0f7db8:14:14 [0] NCCL INFO Using network Socket
NCCL version 2.7.8+cuda11.1
977b3b0f7db8:14:77 [3] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 8/8/64
977b3b0f7db8:14:75 [1] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 8/8/64
977b3b0f7db8:14:74 [0] NCCL INFO Channel 00/08 : 0 1 2 3
977b3b0f7db8:14:74 [0] NCCL INFO Channel 01/08 : 0 3 2 1
977b3b0f7db8:14:74 [0] NCCL INFO Channel 02/08 : 0 3 1 2
977b3b0f7db8:14:77 [3] NCCL INFO Trees [0] 2/-1/-1->3->0|0->3->2/-1/-1 [1] 0/-1/-1->3->2|2->3->0/-1/-1 [2] 2/-1/-1->3->0|0->3->2/-1/-1 [3] 0/-1/-1->3->2|2->3->0/-1/-1 [4] 2/-1/-1->3->0|0->3->2/-1/-1 [5] 0/-1/-1->3->2|2->3->0/-1/-1 [6] 2/-1/-1->3->0|0->3->2/-1/-1 [7] 0/-1/-1->3->2|2->3->0/-1/-1
977b3b0f7db8:14:75 [1] NCCL INFO Trees [0] -1/-1/-1->1->2|2->1->-1/-1/-1 [1] 2/-1/-1->1->-1|-1->1->2/-1/-1 [2] -1/-1/-1->1->2|2->1->-1/-1/-1 [3] 2/-1/-1->1->-1|-1->1->2/-1/-1 [4] -1/-1/-1->1->2|2->1->-1/-1/-1 [5] 2/-1/-1->1->-1|-1->1->2/-1/-1 [6] -1/-1/-1->1->2|2->1->-1/-1/-1 [7] 2/-1/-1->1->-1|-1->1->2/-1/-1
[=========REMOVING SOME LOGS FOR BREVITY==========]
977b3b0f7db8:14:76 [2] NCCL INFO comm 0x7f6a4c002ea0 rank 2 nranks 4 cudaDev 2 busId 1d0 - Init COMPLETE
977b3b0f7db8:14:14 [0] NCCL INFO Launch mode Group/CGMD
For the slow container (PyTorch 1.13.1) I cannot get it to display any NCCL logs. I tried passing NCCL_DEBUG
to INFO
, TRACE
, … you name it with no result. I am under the impression that NCCL is not used at all.
Both containers have the same contents under /etc/nccl.conf
. There is no ~/.nccl.conf
file in either of the systems.
root@977b3b0f7db8:/# cat /etc/nccl.conf
NCCL_DEBUG=INFO
NCCL_SOCKET_IFNAME=^docker0
I tried modifying NCCL_SOCKER_IFNAME=eth0
in the slow container with no result. There is nothing NCCL related that is being printed in the logs.
Can anybody (Patrick? ) help me debug this? Assuming that the source of the slow training times is NCCL, how do I make sure that NCCL is being used by the slow container? (and is NCCL being used for GPU-to-GPU communication when doing DataParallel on a single node?)
BTW, on the slow container, nccl reports available.
root@ce5a6da78e30:/# python
Python 3.9.16 | packaged by conda-forge | (main, Feb 1 2023, 21:39:03)
[GCC 11.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> torch.cuda.nccl.version()
(2, 14, 3)
>>> x = torch.rand(1024, 1024, device='cuda:0')
>>> torch.cuda.nccl.is_available([x])
True
>>>