2 Docker containers slower than 1 with DistributedDataParallel

Hi,

I’m struggling with Docker containerization seeming to negate the speedup of distributed training:

  • GPU training with nn.parallel.DistributedDataParallel
  • 2 processes with 1 GPU each are about 2x faster than 1 process with 1 GPU when run directly on a Google Compute Engine n1-standard-16 instance.
  • Same 2 processes each in one Docker container with one GPU on Google Kubernetes Engine are slower than 1 process with 1 GPU, whether that process is in a container or not. Both containers are again on a single n1-standard-16 machine.
  • Per-process batch size always = 3. I measure speed through the time taken to accumulate gradients to an equivalent batch size of 24. Should take 4 iterations with 2 GPUs or 8 with one.
  • (if it’s relevant) using AMP with opt level O1.

Slow communication due to containerization? Failure to use GPU-GPU communication?

Containers:

  • Image based on pytorch/pytorch:1.3-cuda10.1-cudnn7-devel
  • Request 24 GB memory
  • Seem to use more CPU than raw processes
  • Use host network and IPC namespace for the init_process_group TCP initialization to work (is this the best way?)

I tried:

  • NCCL
  • Gloo (bit slower than NCCL)
  • Putting containers on 2 separate machines (quite a bit slower)

NCCL initialization logs:

NCCL INFO Bootstrap : Using [0]eth0:10.128.0.72<0> [1]cbr0:10.44.78.1<0> [2]vetha22648b8:fe80::286b:3cff:fef3:6eea%vetha22648b8<0>
NCCL INFO NET/Plugin : No plugin found (libnccl-net.so).
misc/ibvwrap.cc:63 NCCL WARN Failed to open libibverbs.so[.1]
NCCL INFO NET/Socket : Using [0]eth0:10.128.0.72<0> [1]cbr0:10.44.78.1<0> [2]vetha22648b8:fe80::286b:3cff:fef3:6eea%vetha22648b8<0>
NCCL version 2.4.8+cuda10.1
NCCL INFO Bootstrap : Using [0]eth0:10.128.0.72<0> [1]cbr0:10.44.78.1<0> [2]vetha22648b8:fe80::286b:3cff:fef3:6eea%vetha22648b8<0>
NCCL INFO NET/Plugin : No plugin found (libnccl-net.so).
misc/ibvwrap.cc:63 NCCL WARN Failed to open libibverbs.so[.1]
NCCL INFO NET/Socket : Using [0]eth0:10.128.0.72<0> [1]cbr0:10.44.78.1<0> [2]vetha22648b8:fe80::286b:3cff:fef3:6eea%vetha22648b8<0>
NCCL INFO Setting affinity for GPU 0 to ffff
NCCL INFO Setting affinity for GPU 0 to ffff
NCCL INFO Could not find real path of /sys/class/net/cbr0/device
NCCL INFO include/net.h:19 -> 2
NCCL INFO Could not find real path of /sys/class/net/vetha22648b8/device
NCCL INFO include/net.h:19 -> 2
NCCL INFO CUDA Dev 0[0], Socket NIC distance :  PHB SYS SYS
NCCL INFO Could not find real path of /sys/class/net/cbr0/device
NCCL INFO include/net.h:19 -> 2
NCCL INFO Could not find real path of /sys/class/net/vetha22648b8/device
NCCL INFO include/net.h:19 -> 2
NCCL INFO CUDA Dev 0[1], Socket NIC distance :  PHB SYS SYS
NCCL INFO Channel 00 :    0   1
NCCL INFO Ring 00 : 0 -> 1 [receive] via NET/Socket/0
NCCL INFO NET/Socket: Using 1 threads and 1 sockets per thread
NCCL INFO Ring 00 : 1 -> 0 [send] via NET/Socket/0
NCCL INFO Ring 00 : 1 -> 0 [receive] via NET/Socket/0
NCCL INFO NET/Socket: Using 1 threads and 1 sockets per thread
NCCL INFO Ring 00 : 0 -> 1 [send] via NET/Socket/0
NCCL INFO Using 256 threads, Min Comp Cap 7, Trees disabled
NCCL INFO comm 0x7f478c0019e0 rank 0 nranks 2 cudaDev 0 nvmlDev 0 - Init COMPLETE
NCCL INFO Launch mode Parallel
NCCL INFO comm 0x7f82300019e0 rank 1 nranks 2 cudaDev 0 nvmlDev 1 - Init COMPLETE

Any help would be greatly appreciated!

Hi, since you mentioned that 2 processes with 1 GPU each attains the expected speedup in the Google cloud env, it leads me to think that there may be a difference in configuration between that and the Docker env. Could you try the following and see if it works for you?

Setting export OMP_NUM_THREADS=1 as explained in https://github.com/pytorch/pytorch/issues/22451

For your hypothesis about slower GPU to GPU communication, it may be worthwhile to debug which portions of the training are slower on the Docker instances. You can add instrumentation to determine which parts of training (initialization, data loading, forward, backward, etc) are slower than on Compute Engine. One way to do this would be to use the torch.cuda.Event.elapsed_time() API (https://pytorch.org/docs/stable/_modules/torch/cuda/streams.html#Event.elapsed_time) to record GPU computations (an example is available here: https://pytorch.org/docs/stable/notes/cuda.html#cuda-semantics)