DistributedDataParallel training not efficient

Very interesting project!

So basically training with 4 GPUS needs 4 epochs to get the same results like a single GPU achieves in only 1 epoch.

This is not true if you consider the sync among 4 GPUs per epoch. It should be equivalent to running 4 epochs on a single GPU.

Can you confirm if there is any communication between different processes (by printing the gradient values of different ranks after backward)? Gradients of different ranks should be the same after backward.

Additionally, your repo has an arg average_gradients. If you turn on this as a duplicate gradient averaging step, will it achieve the same accuracy as a single GPU?