Unable to see NCCL logs

mkserge · March 28, 2023, 6:59pm

Hello,

I am training a small 1 hidden-layer NN using PyTorch DataParallel approach. Migrating to DistributedDataParallel is my goal, but the following issue is blocking me.

I have two containers that are somewhat out of my control. One has PyTorch 1.9.1, Python 3.8, CUDA 11.7 and the other one has PyTorch 1.13.1, Python3.9 and CUDA 11.7. I am Running these containers on a node with 4 V100 GPUs.

PyTorch 1.9.1 container runs significantly faster (~50%) than the PyTorch 1.13.1 container and I am trying to root-cause this. The only difference I see in the logs is NCCL initialization output that is present in the fast container and absent in the slow container.

Here’s a sample of the NCCL logs from the fast container.

977b3b0f7db8:14:14 [0] NCCL INFO Bootstrap : Using [0]lo:127.0.0.1<0> [1]eth0:172.17.0.3<0>

977b3b0f7db8:14:14 [0] ofi_init:1134 NCCL WARN NET/OFI Only EFA provider is supported
977b3b0f7db8:14:14 [0] NCCL INFO NET/IB : No device found.
977b3b0f7db8:14:14 [0] NCCL INFO NET/Socket : Using [0]lo:127.0.0.1<0> [1]eth0:172.17.0.3<0>
977b3b0f7db8:14:14 [0] NCCL INFO Using network Socket
NCCL version 2.7.8+cuda11.1
977b3b0f7db8:14:77 [3] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 8/8/64
977b3b0f7db8:14:75 [1] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 8/8/64
977b3b0f7db8:14:74 [0] NCCL INFO Channel 00/08 :    0   1   2   3
977b3b0f7db8:14:74 [0] NCCL INFO Channel 01/08 :    0   3   2   1
977b3b0f7db8:14:74 [0] NCCL INFO Channel 02/08 :    0   3   1   2
977b3b0f7db8:14:77 [3] NCCL INFO Trees [0] 2/-1/-1->3->0|0->3->2/-1/-1 [1] 0/-1/-1->3->2|2->3->0/-1/-1 [2] 2/-1/-1->3->0|0->3->2/-1/-1 [3] 0/-1/-1->3->2|2->3->0/-1/-1 [4] 2/-1/-1->3->0|0->3->2/-1/-1 [5] 0/-1/-1->3->2|2->3->0/-1/-1 [6] 2/-1/-1->3->0|0->3->2/-1/-1 [7] 0/-1/-1->3->2|2->3->0/-1/-1
977b3b0f7db8:14:75 [1] NCCL INFO Trees [0] -1/-1/-1->1->2|2->1->-1/-1/-1 [1] 2/-1/-1->1->-1|-1->1->2/-1/-1 [2] -1/-1/-1->1->2|2->1->-1/-1/-1 [3] 2/-1/-1->1->-1|-1->1->2/-1/-1 [4] -1/-1/-1->1->2|2->1->-1/-1/-1 [5] 2/-1/-1->1->-1|-1->1->2/-1/-1 [6] -1/-1/-1->1->2|2->1->-1/-1/-1 [7] 2/-1/-1->1->-1|-1->1->2/-1/-1
[=========REMOVING SOME LOGS FOR BREVITY==========]
977b3b0f7db8:14:76 [2] NCCL INFO comm 0x7f6a4c002ea0 rank 2 nranks 4 cudaDev 2 busId 1d0 - Init COMPLETE
977b3b0f7db8:14:14 [0] NCCL INFO Launch mode Group/CGMD

For the slow container (PyTorch 1.13.1) I cannot get it to display any NCCL logs. I tried passing NCCL_DEBUG to INFO, TRACE, … you name it with no result. I am under the impression that NCCL is not used at all.

Both containers have the same contents under /etc/nccl.conf. There is no ~/.nccl.conf file in either of the systems.

root@977b3b0f7db8:/# cat /etc/nccl.conf
NCCL_DEBUG=INFO
NCCL_SOCKET_IFNAME=^docker0

I tried modifying NCCL_SOCKER_IFNAME=eth0 in the slow container with no result. There is nothing NCCL related that is being printed in the logs.

Can anybody (Patrick? ) help me debug this? Assuming that the source of the slow training times is NCCL, how do I make sure that NCCL is being used by the slow container? (and is NCCL being used for GPU-to-GPU communication when doing DataParallel on a single node?)

BTW, on the slow container, nccl reports available.

root@ce5a6da78e30:/# python
Python 3.9.16 | packaged by conda-forge | (main, Feb  1 2023, 21:39:03)
[GCC 11.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> torch.cuda.nccl.version()
(2, 14, 3)
>>> x = torch.rand(1024, 1024, device='cuda:0')
>>> torch.cuda.nccl.is_available([x])
True
>>>

ptrblck · March 28, 2023, 8:50pm

This could be the case and you could profile your code via e.g. nsys nvprof ... to see if any NCCL kernels are launched at all.

mkserge · March 28, 2023, 10:45pm

Thanks a lot Patrick. I profiled both jobs with nvprof and indeed it looks like the slow version is not using any NCCL…

The output from the fast job at the top has

GPU activities:   23.05%  28.6632s      2496  11.484ms  11.271ms  11.604ms  ncclBroadcastRingLLKernel_copy_i8(ncclColl)
                   15.70%  19.5256s      9360  2.0861ms  30.848us  7.6814ms  void at::native::_GLOBAL__N__52_tmpxft_0000adfc_00000000_10_Shape_compute_80_cpp1_ii_cedd8df2::CatArrayBatchedCopy<float, unsigned int, int=2, int=128, int=1>(float*, at::native::_GLOBAL__N__52_tmpxft_0000adfc_00000000_10_Shape_compute_80_cpp1_ii_cedd8df2::CatArrInputTensorMetadata<at::native::_GLOBAL__N__52_tmpxft_0000adfc_00000000_10_Shape_compute_80_cpp1_ii_cedd8df2::CatArrayBatchedCopy<float, unsigned int, int=2, int=128, int=1>, unsigned int, int=128, int=1>, at::native::_GLOBAL__N__52_tmpxft_0000adfc_00000000_10_Shape_compute_80_cpp1_ii_cedd8df2::TensorSizeStride<at::native::_GLOBAL__N__52_tmpxft_0000adfc_00000000_10_Shape_compute_80_cpp1_ii_cedd8df2::CatArrInputTensorMetadata, unsigned int=4>, int, at::native::_GLOBAL__N__52_tmpxft_0000adfc_00000000_10_Shape_compute_80_cpp1_ii_cedd8df2::CatArrInputTensorMetadata)
                   10.08%  12.5399s      2496  5.0240ms  2.0200ms  5.0358ms  void at::native::_GLOBAL__N__54_tmpxft_0000ae56_00000000_10_SoftMax_compute_80_cpp1_ii_a3310042::cunn_SoftMaxBackward<int=4, float, float, float, at::native::_GLOBAL__N__54_tmpxft_0000ae56_00000000_10_SoftMax_compute_80_cpp1_ii_a3310042::LogSoftMaxBackwardEpilogue>(float*, float*, float, int)
                    9.39%  11.6809s      2496  4.6799ms  1.8922ms  4.7209ms  volta_sgemm_128x64_tn
                    9.36%  11.6451s      2492  4.6730ms  4.6611ms  4.7133ms  volta_sgemm_128x32_sliced1x4_nt
                    9.08%  11.2902s      2492  4.5306ms  4.4826ms  4.7492ms  volta_sgemm_128x128_nn

While the output from the slow one

GPU activities:   38.33%  58.1016s     16835  3.4512ms  1.7600us  35.691ms  [CUDA memcpy PtoP]
                   13.37%  20.2622s      9347  2.1678ms  32.512us  7.9324ms  void at::native::_GLOBAL__N__9cdb9df5_8_Shape_cu_49f7391c::CatArrayBatchedCopy<float, unsigned int, int=2, int=128, int=1>(float*, at::native::_GLOBAL__N__9cdb9df5_8_Shape_cu_49f7391c::CatArrInputTensorMetadata<at::native::_GLOBAL__N__9cdb9df5_8_Shape_cu_49f7391c::CatArrayBatchedCopy<float, unsigned int, int=2, int=128, int=1>, unsigned int, int=128, int=1>, at::native::_GLOBAL__N__9cdb9df5_8_Shape_cu_49f7391c::TensorSizeStride<at::native::_GLOBAL__N__9cdb9df5_8_Shape_cu_49f7391c::CatArrInputTensorMetadata, unsigned int=4>, int, at::native::_GLOBAL__N__9cdb9df5_8_Shape_cu_49f7391c::CatArrInputTensorMetadata)
                    8.27%  12.5317s      2492  5.0288ms  2.0223ms  5.0411ms  void at::native::_GLOBAL__N__8455d03d_10_SoftMax_cu_9f978f63::cunn_SoftMaxBackward<int=4, float, float, float, at::native::_GLOBAL__N__8455d03d_10_SoftMax_cu_9f978f63::LogSoftMaxBackwardEpilogue>(float*, float*, float, int)
                    7.71%  11.6943s      2494  4.6890ms  1.8946ms  4.7172ms  volta_sgemm_128x64_tn
                    7.67%  11.6229s      2488  4.6716ms  4.6533ms  4.6945ms  volta_sgemm_128x32_sliced1x4_nt

Any insights? How would I “force” NCCL on the slow version, do you know?

(Since this is a DataParallel job, I don’t really have a way of setting a backend I think, unlike distributed data parallel)

Edit: Also interesting thing I see that the slow job seems to spend a lot of time (38.33%) copying data from GPU to GPU (memcpy PtoP), while the fast one spends only 2% of the time doing that and the majority of the time is spend doing ncclBroadcastRingLLKernel_copy_i8. It does kind of seem like there is no NCCL in the slow scenario.

ptrblck · March 29, 2023, 5:34am

I don’t know how exactly the current nn.DataParallel implementation works as it’s deprecated for some time now and we generally recommend using DistributedDataParallel for better performance.

This is expected in DataParallel since each forward pass will copy the entire model to each device replica while DistributedDataParallel only does it during its initial setup.

mkserge · March 29, 2023, 4:49pm

Thanks a lot Patrick. To clarify, both scenarios are running DataParallel, but one is using NCCL and the more recent container does not for some reason. Exact same code base, just environments differ. There is also nothing obvious when comparing both dockerfiles. Both install NCCL and in both running torch.cuda.nccl.is_available() returns True.

Interestingly running the same job on the official PyTorch container pytorch/pytorch:1.9.1-cuda11.1-cudnn8-devel I don’t see the code using NCCL either. (I had to downgrade PyTorch to 1.9.1 manually cause the official docker image comes with 1.10.0 even though it’s tagged with 1.9.1). So same PyTorch version with same CUDA, the fast version is using NCCL somehow, while the slow version (and the official version) don’t use NCCL. It really appears to be something in the environment that triggers PyTorch to use NCCL in the fast scenario.

Agreed on Distributed Data Parallel, this whole thing is actually for me to migrate to DDP. But in order to do so, I need to upgrade the base container, and when I upgrade I get a massive hit in training speed. So it becomes a harder sell to migrate to DDP in prod because my baseline is now much slower

Anything else you can think of that I should dig? Are you saying that it’s somewhere in the PyTorch code base that they likely switch from NCCL to memcopy? (but then why would the official 1.9.1 release also use memcopy?)

ptrblck · March 29, 2023, 7:57pm

This might be the case, but I also don’t know what (if anything) changed in nn.DataParallel, since I’m not using it and haven’t checked what exactly this code does based on its deprecation.

I think the valid argument to switch to DDP is that it’s our recommendation for better perf., is heavily used, and is not deprecated.

I can only speculate, but maybe internal dispatching is taking a different path depending on the environment setup. However, as DataParallel is deprecated the interest of digging into the code base and debug the issue would be quite low as most likely a fix might not even be accepted.

mkserge · March 29, 2023, 8:01pm

Understood. Thanks a lot for your input Patrick!

ptrblck · March 29, 2023, 11:52pm

Sure and please let me know if you get stuck in using DDP and see any kind of functionality or performance issues!

rajachan · April 4, 2023, 5:45pm

Sergey - This looks like a NCCL network plugin selection issue to me.

977b3b0f7db8:14:14 [0] ofi_init:1134 NCCL WARN NET/OFI Only EFA provider is supported
977b3b0f7db8:14:14 [0] NCCL INFO NET/IB : No device found.

What kind of cluster are you running this on? It seems to me like the AWS libfabric plugin is loaded and the IB verbs provider is selected explicitly, but you are running on a cluster without either of those? Your environment might be polluted with incorrect/stale NCCL_* variables.

mkserge · April 4, 2023, 8:54pm

Thanks Raghu,

I am running these jobs on a single p3.16xlarge instance with 8 V100 GPUs. My understanding is that the WARN above comes from aws-ofi-nccl plugin which is not supported on the instances that I am running my jobs on.

BTW, I modified the embedding array to be dense (it was sparse before) and now I do see NCCL logs in both scenarios, but the slow version is still more than 2x slow.

The warning message in your post is present in both scenarios. Here’s the relevant bit for the fast version.

[2023-04-04 20:37:17.365 0b13273ea9b4:15 INFO utils.py:27] RULE_JOB_STOP_SIGNAL_FILENAME: None
[2023-04-04 20:37:17.426 0b13273ea9b4:15 INFO profiler_config_parser.py:102] Unable to find config at /opt/ml/input/config/profilerconfig.json. Profiler is disabled.
0b13273ea9b4:15:15 [0] NCCL INFO Bootstrap : Using [0]lo:127.0.0.1<0> [1]eth0:172.17.0.11<0>

0b13273ea9b4:15:15 [0] ofi_init:1134 NCCL WARN NET/OFI Only EFA provider is supported
0b13273ea9b4:15:15 [0] NCCL INFO NET/IB : No device found.
0b13273ea9b4:15:15 [0] NCCL INFO NET/Socket : Using [0]lo:127.0.0.1<0> [1]eth0:172.17.0.11<0>
0b13273ea9b4:15:15 [0] NCCL INFO Using network Socket
NCCL version 2.7.8+cuda11.1

And here’s the relevant bit on the slow version,

0363c6547f6e:301:362 [0] NCCL INFO cudaDriverVersion 12000
0363c6547f6e:301:362 [0] NCCL INFO Bootstrap : Using lo:127.0.0.1<0>
0363c6547f6e:301:362 [0] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v6 symbol.
0363c6547f6e:301:362 [0] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin symbol (v4 or v5).
NCCL version 2.14.3+cuda11.7
0363c6547f6e:301:367 [0] NCCL INFO NET/OFI Using aws-ofi-nccl 1.4.0aws
0363c6547f6e:301:367 [0] 9.395920 get_platform_type:1112 NCCL TRACE NET/OFI Read 8 bytes. EC2 platform type is HVM domU
0363c6547f6e:301:367 [0] NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1
0363c6547f6e:301:367 [0] 13.970224 find_ofi_provider:597 NCCL TRACE NET/OFI Could not find any optimal provider supporting GPUDirect RDMA

0363c6547f6e:301:367 [0] ofi_init:1304 NCCL WARN NET/OFI Only EFA provider is supported

0363c6547f6e:301:367 [0] ofi_init:1355 NCCL WARN NET/OFI aws-ofi-nccl initialization failed
0363c6547f6e:301:367 [0] NCCL INFO NET/IB : No device found.
0363c6547f6e:301:367 [0] NCCL INFO NET/Socket : Using [0]lo:127.0.0.1<0> [1]eth0:172.17.0.11<0>
0363c6547f6e:301:367 [0] NCCL INFO Using network Socket
0363c6547f6e:301:370 [3] NCCL INFO Using network Socket
0363c6547f6e:301:368 [1] NCCL INFO Using network Socket
0363c6547f6e:301:369 [2] NCCL INFO Using network Socket

Anything useful in the logs here? Both setups have the same /etc/nccl.conf file with the following

NCCL_DEBUG=INFO
NCCL_SOCKET_IFNAME=^docker0

(I exported NCCL_DEBUG=TRACE to produce the snippets below). I also inspected the environment variables in both scenarios with nothing jumping at me that is NCCL related. Any ideas what NCCL env variables I should explore?