This was my issue:
Here’s a screenshot of distributed training in Pytorch when I call the train function like:
CUDA_VISIBLE_DEVICES=1,2,3,4 python -m torch.distributed.launch --nproc_per_node=4 train_new.py. You can see that the first rank has also initted 3 separate processes for each other GPU. When I use 10 GPUs on a box this severely limits the batch size, since the 0th dimension node has so much less capacity. What is it storing? I thought gradients in DDP were all-reduced. I’ve also tried turning broadcast_buffers to False to no avail.
Model is stacked modules of 1D-conv, relu, batch norm, LSTM, followed by a large softmax layer and CTC loss. Backend is NCCL
Pytorch 1.3.0, Cuda 10.1, Titan RTX, Ubuntu 18.04. Can provide more code upon request.