Whatever article, GitHub issue, or Pytorch/StackOverflow question I read, I still can’t seem to get a clear-cut answer on this.
Suppose I have the following scenario:
Scenario: 1 node, 4 GPUs, batch size M
According to this documentation, DDP will create 4 processes each having their own dataloader and model. On most official Pytorch implementations of DDP (eg. Imagenet models), common practice is to divide the batch size by the number of GPUs per node. This means that each dataloader (each process) will take batches of size M/4. Further on in this article, it explains that the following:
When gradients in one bucket are all ready, the
Reducer kicks off an asynchronous
allreduce on that bucket to calculate mean of gradients across all processes. When all buckets are ready, the
Reducer will block waiting for all
allreduce operations to finish. When this is done, averaged gradients are written to the
param.grad field of all parameters.
Question 1: What is a bucket here? Is it a collection of training examples (eg. 1 per process or 1 per GPU?)
Question 2: Since we take the average gradient across processes that have a batch size of M/4 isn’t this incorrect since we actually want the average gradient of batch size of M?
Question 3: In a multi-node setting, to be consistent with the above explanation, shouldn’t we divide by the total number of GPUs rather than by the GPUs per node? Eg. for 2 nodes and 4 GPUs, shouldn’t we divide M/8?
Question 4: In general for N nodes having M GPUs each, how should we divide the batch size across the nodes and GPUs M if we want our training to reflect the results of mini-batch GD with batch size M?