Does DistributedDataParallel calculate the average gradient across each GPU or each node?

Whatever article, GitHub issue, or Pytorch/StackOverflow question I read, I still can’t seem to get a clear-cut answer on this.

Suppose I have the following scenario:

Scenario: 1 node, 4 GPUs, batch size M

According to this documentation, DDP will create 4 processes each having their own dataloader and model. On most official Pytorch implementations of DDP (eg. Imagenet models), common practice is to divide the batch size by the number of GPUs per node. This means that each dataloader (each process) will take batches of size M/4. Further on in this article, it explains that the following:

When gradients in one bucket are all ready, the Reducer kicks off an asynchronous allreduce on that bucket to calculate mean of gradients across all processes. When all buckets are ready, the Reducer will block waiting for all allreduce operations to finish. When this is done, averaged gradients are written to the param.grad field of all parameters.

Question 1: What is a bucket here? Is it a collection of training examples (eg. 1 per process or 1 per GPU?)
Question 2: Since we take the average gradient across processes that have a batch size of M/4 isn’t this incorrect since we actually want the average gradient of batch size of M?
Question 3: In a multi-node setting, to be consistent with the above explanation, shouldn’t we divide by the total number of GPUs rather than by the GPUs per node? Eg. for 2 nodes and 4 GPUs, shouldn’t we divide M/8?
Question 4: In general for N nodes having M GPUs each, how should we divide the batch size across the nodes and GPUs M if we want our training to reflect the results of mini-batch GD with batch size M?

  1. A bucket is DDP’s notion of a collection of parameter gradients, so a bucket can essentially contain 1 or more parameter grads.
  2. Yeah, if loss is summed across instances, gradient will be 4 times smaller. Please see the note here: pytorch/distributed.py at master · pytorch/pytorch · GitHub
  3. We divide by the number of training processes, so if we have 2 nodes and 4 GPUs per node, we indeed divide by 8.
  4. In this case, you have M * N workers. Given a local batch size B, you can have a batch B /(M * N) on each node, if loss is summed instead of averaged, you may have to scale the gradients by the world size N.

In this case, you have M * N workers. Given a local batch size B, you can have a batch B /(M * N) on each node, if loss is summed instead of averaged, you may have to scale the gradients by the world size N.

Why each node? If we have batch size B, M GPUs, N nodes, shouldn’t we divide like B/(MN) on each gpu? So each GPU out of the MN GPUs will take in batch size B/(M*N)