DistributedDataParallel on multiple GPU nodes slower than one GPU node

Yinglei_Zhang · April 9, 2020, 7:49am

Hi, i am using PyTorch DistributedDataParallel to train some models and surpisingly. It become slower when i move from one GPU node to 2 GPU nodes.

Here is the example i am running: https://github.com/huggingface/transformers/blob/v2.3.0/examples/run_lm_finetuning.py

Here is my infrastructure:

Single Node: p3.16xlarge
Training time: 36 mins

Two Nodes: p3.16xlarge
Training time: 1 h 45 mins

In both cases, i am using PyTorch distributed data parallel and GPU utilization is almost always be 100%.

I have enabled NCCL_DEBUG=INFO
I copied the nccl output from single node training and multiple node training in this link below.

gist.github.com

https://gist.github.com/YingleiZhang/a8df48eb534ba20ff8f26b5309094b55

NCCL debug info on multiple node:

ip-10-0-42-19:13243:13243 [2] NCCL INFO Bootstrap : Using [0]eth0:10.0.42.19<0>
ip-10-0-42-19:13282:13282 [7] NCCL INFO Bootstrap : Using [0]eth0:10.0.42.19<0>
ip-10-0-42-19:13243:13243 [2] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
ip-10-0-42-19:13282:13282 [7] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
 
ip-10-0-42-19:13243:13243 [2] misc/ibvwrap.cc:63 NCCL WARN Failed to open libibverbs.so[.1]
 
ip-10-0-42-19:13282:13282 [7] misc/ibvwrap.cc:63 NCCL WARN Failed to open libibverbs.so[.1]
ip-10-0-42-19:13282:13282 [7] NCCL INFO NET/Socket : Using [0]eth0:10.0.42.19<0>
ip-10-0-42-19:13243:13243 [2] NCCL INFO NET/Socket : Using [0]eth0:10.0.42.19<0>

This file has been truncated. show original

NCCL debug info on single node:

ip-10-0-28-255:46114:46114 [0] NCCL INFO Bootstrap : Using [0]eth0:10.0.28.255<0>
ip-10-0-28-255:46114:46114 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
 
ip-10-0-28-255:46114:46114 [0] misc/ibvwrap.cc:63 NCCL WARN Failed to open libibverbs.so[.1]
ip-10-0-28-255:46114:46114 [0] NCCL INFO NET/Socket : Using [0]eth0:10.0.28.255<0>
NCCL version 2.5.6+cuda10.0
ip-10-0-28-255:46149:46149 [5] NCCL INFO Bootstrap : Using [0]eth0:10.0.28.255<0>
ip-10-0-28-255:46149:46149 [5] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
 
ip-10-0-28-255:46149:46149 [5] misc/ibvwrap.cc:63 NCCL WARN Failed to open libibverbs.so[.1]

This file has been truncated. show original

It seems on single node mode, NCCL are creating a lot of CHANNELs, but not in the multiple node mode. Could this slow speed due to the networking issues between different GPU nodes?

Any help is appreciated.

mrshenli · April 10, 2020, 6:24pm

Two questions,

Did you divide the the epoch size on each process by world_size?
Will there be any contention on the data loader?

cc @osalpekar

Also cc @zhangguanheng66 for transformer questions

Yinglei_Zhang · April 11, 2020, 6:28am

We are tracking this issue here: https://github.com/NVIDIA/nccl/issues/318

It looks like the bottleneck is the network bandwidth between two GPU nodes. This example is fine tuning roberta-base model, which is about 500 MB. During the synchronization of two nodes cases, the amount of data would be 8G. NCCL folks mentioned that on single instances, this communication is supported by nvlink, and it is about 120GB/s, however on if this is tcp/ip sockets, even if the bandwidth is 100GB/s, the all reduce bandwidth would be 10GB/s. So like a order of magnitude smaller, therefore the communication time become the bottleneck.

In this example, the epoch is set to 1 for both single node and two nodes case.
The training script uses distributedSampler, so the data partition on each node should be mutually exclusive.

if you feel there is anything else other than the network bandwidth would cause any issues, let me know.