DistributedDataParallel on multiple GPU nodes slower than one GPU node

Hi, i am using PyTorch DistributedDataParallel to train some models and surpisingly. It become slower when i move from one GPU node to 2 GPU nodes.

Here is the example i am running: https://github.com/huggingface/transformers/blob/v2.3.0/examples/run_lm_finetuning.py

Here is my infrastructure:

Single Node: p3.16xlarge
Training time: 36 mins

Two Nodes: p3.16xlarge
Training time: 1 h 45 mins

In both cases, i am using PyTorch distributed data parallel and GPU utilization is almost always be 100%.

I have enabled NCCL_DEBUG=INFO
I copied the nccl output from single node training and multiple node training in this link below.

It seems on single node mode, NCCL are creating a lot of CHANNELs, but not in the multiple node mode. Could this slow speed due to the networking issues between different GPU nodes?

Any help is appreciated.

Two questions,

  1. Did you divide the the epoch size on each process by world_size?
  2. Will there be any contention on the data loader?

cc @osalpekar

Also cc @zhangguanheng66 for transformer questions

We are tracking this issue here: https://github.com/NVIDIA/nccl/issues/318

It looks like the bottleneck is the network bandwidth between two GPU nodes. This example is fine tuning roberta-base model, which is about 500 MB. During the synchronization of two nodes cases, the amount of data would be 8G. NCCL folks mentioned that on single instances, this communication is supported by nvlink, and it is about 120GB/s, however on if this is tcp/ip sockets, even if the bandwidth is 100GB/s, the all reduce bandwidth would be 10GB/s. So like a order of magnitude smaller, therefore the communication time become the bottleneck.

In this example, the epoch is set to 1 for both single node and two nodes case.
The training script uses distributedSampler, so the data partition on each node should be mutually exclusive.

if you feel there is anything else other than the network bandwidth would cause any issues, let me know.