Network parameter sync in forward pass

DzReal · December 18, 2019, 12:09am

Hi, all

I’m trying to use the distributed package for multi-gpu training. Because of the way the code is written, the master process does all the initialisations (creating model replicas, optimisers etc.). From pytorch source code, it seems like during forward pass, all model replicas will be synced with the one running in subprocess with rank 0. Does that mean I could just initialise one optimiser for subprocess 0 and only update the parameters of the first model replica?

Thanks,

mrshenli · December 26, 2019, 4:14pm

Hi @DzReal

From pytorch source code, it seems like during forward pass, all model replicas will be synced with the one running in subprocess with rank 0.

If you are using DistributedDataParallel, above is actually not true. The distributed sync occurs during the backward pass, and it averages all gradients instead of parameters (check torch/csrc/distributed/c10d/reducer.cpp). So that when the optimizer consumes those gradients, they are already global gradients.

The sync you mentioned in the forward pass might be this. This only does intra-node sync when you use one DDP process to work on multiple GPUs.

Does that mean I could just initialise one optimiser for subprocess 0 and only update the parameters of the first model replica?

No. Each DDP process should have its own local optimizer.

DzReal · January 11, 2020, 1:11am

So when running on a single node with multiple GPUs, could I only use one optimiser?

mrshenli · January 16, 2020, 5:17pm

So when running on a single node with multiple GPUs, could I only use one optimiser?

You will need one optimizer per DDP process, regardless of where those DDP processes are. I hope the following note could help explain it: Distributed Data Parallel — PyTorch master documentation