Hi,
I’m struggling with Docker containerization seeming to negate the speedup of distributed training:
- GPU training with nn.parallel.DistributedDataParallel
- 2 processes with 1 GPU each are about 2x faster than 1 process with 1 GPU when run directly on a Google Compute Engine n1-standard-16 instance.
- Same 2 processes each in one Docker container with one GPU on Google Kubernetes Engine are slower than 1 process with 1 GPU, whether that process is in a container or not. Both containers are again on a single n1-standard-16 machine.
- Per-process batch size always = 3. I measure speed through the time taken to accumulate gradients to an equivalent batch size of 24. Should take 4 iterations with 2 GPUs or 8 with one.
- (if it’s relevant) using AMP with opt level O1.
Slow communication due to containerization? Failure to use GPU-GPU communication?
Containers:
- Image based on pytorch/pytorch:1.3-cuda10.1-cudnn7-devel
- Request 24 GB memory
- Seem to use more CPU than raw processes
- Use host network and IPC namespace for the init_process_group TCP initialization to work (is this the best way?)
I tried:
- NCCL
- Gloo (bit slower than NCCL)
- Putting containers on 2 separate machines (quite a bit slower)
NCCL initialization logs:
NCCL INFO Bootstrap : Using [0]eth0:10.128.0.72<0> [1]cbr0:10.44.78.1<0> [2]vetha22648b8:fe80::286b:3cff:fef3:6eea%vetha22648b8<0>
NCCL INFO NET/Plugin : No plugin found (libnccl-net.so).
misc/ibvwrap.cc:63 NCCL WARN Failed to open libibverbs.so[.1]
NCCL INFO NET/Socket : Using [0]eth0:10.128.0.72<0> [1]cbr0:10.44.78.1<0> [2]vetha22648b8:fe80::286b:3cff:fef3:6eea%vetha22648b8<0>
NCCL version 2.4.8+cuda10.1
NCCL INFO Bootstrap : Using [0]eth0:10.128.0.72<0> [1]cbr0:10.44.78.1<0> [2]vetha22648b8:fe80::286b:3cff:fef3:6eea%vetha22648b8<0>
NCCL INFO NET/Plugin : No plugin found (libnccl-net.so).
misc/ibvwrap.cc:63 NCCL WARN Failed to open libibverbs.so[.1]
NCCL INFO NET/Socket : Using [0]eth0:10.128.0.72<0> [1]cbr0:10.44.78.1<0> [2]vetha22648b8:fe80::286b:3cff:fef3:6eea%vetha22648b8<0>
NCCL INFO Setting affinity for GPU 0 to ffff
NCCL INFO Setting affinity for GPU 0 to ffff
NCCL INFO Could not find real path of /sys/class/net/cbr0/device
NCCL INFO include/net.h:19 -> 2
NCCL INFO Could not find real path of /sys/class/net/vetha22648b8/device
NCCL INFO include/net.h:19 -> 2
NCCL INFO CUDA Dev 0[0], Socket NIC distance : PHB SYS SYS
NCCL INFO Could not find real path of /sys/class/net/cbr0/device
NCCL INFO include/net.h:19 -> 2
NCCL INFO Could not find real path of /sys/class/net/vetha22648b8/device
NCCL INFO include/net.h:19 -> 2
NCCL INFO CUDA Dev 0[1], Socket NIC distance : PHB SYS SYS
NCCL INFO Channel 00 : 0 1
NCCL INFO Ring 00 : 0 -> 1 [receive] via NET/Socket/0
NCCL INFO NET/Socket: Using 1 threads and 1 sockets per thread
NCCL INFO Ring 00 : 1 -> 0 [send] via NET/Socket/0
NCCL INFO Ring 00 : 1 -> 0 [receive] via NET/Socket/0
NCCL INFO NET/Socket: Using 1 threads and 1 sockets per thread
NCCL INFO Ring 00 : 0 -> 1 [send] via NET/Socket/0
NCCL INFO Using 256 threads, Min Comp Cap 7, Trees disabled
NCCL INFO comm 0x7f478c0019e0 rank 0 nranks 2 cudaDev 0 nvmlDev 0 - Init COMPLETE
NCCL INFO Launch mode Parallel
NCCL INFO comm 0x7f82300019e0 rank 1 nranks 2 cudaDev 0 nvmlDev 1 - Init COMPLETE
Any help would be greatly appreciated!