Different training results using a single GPU, multi-GPU and gradient accumulation on a single GPU

Azin · December 23, 2021, 11:25am

Hello,

I am training the exact same network in 3 setting: 1GPU (A100), 1GPU using gradient accumulation (Titan), and 2GPUs (V100) via DataParallel.

The problem is that I get 3 fairly different training outcomes with 3 fairly different validation results.

The effective batch size of all 3 experiments is the same. For the 1 GPU and 2 GPU setting without gradient accumulation, it is 6 and for the single GPU + gradient accumulation it is 3 with accumulation step of 2.

Only group norms are used in the network (no batch norms are used).
I use the same learning rate for all the experiments.
For the experiment using a single GPU with gradient accumulation I divide the loss by the accumulation steps as suggested here. (Is this really necessary?)

I would appreciate any hints on what I might be doing wrong? Should not the training be fairly similar in these 3 cases?

mrshenli · January 4, 2022, 4:00am

For the experiment using a single GPU with gradient accumulation I divide the loss by the accumulation steps as suggested here. (Is this really necessary?)

The linked post focuses on DistributedDataParallel (DDP), which is different from DataParallel (DP). For DDP, every process/rank/device compute its own local loss, while for DP, the loss is global across all devices. I am not sure if this contributes to the discrepancy you saw.

The effective batch size of all 3 experiments is the same.

What do you mean by effective batch size? I assume the input batch size used in every iteration should be the same?

Azin · January 4, 2022, 10:07am

Thank you for your reply.

The linked post focuses on DistributedDataParallel (DDP), which is different from DataParallel (DP). For DDP, every process/rank/device compute its own local loss, while for DP, the loss is global across all devices. I am not sure if this contributes to the discrepancy you saw.

It is true that the loss is global across all the GPUs, but I think that should still hold for doing gradient accumulation. I do not divide the loss if I use multiple GPUs, but I should divide the loss if I have gradient accumulation step > 1. Right?
Because if I run the model for a batch and get some loss and some gradients and again run the model for another batch and and get another loss and other gradients and accumulate the gradients for both batches (because I do not run optimizer.zero_grad() after the first backward()), it is like the case that those loss values are summed up, not averaged. Therefore it makes sense to divide the loss by the accumulation steps. Is that not true?

What do you mean by effective batch size? I assume the input batch size used in every iteration should be the same?

Yes. Each time before the optimizer.step() the same number of training samples are fed to the network.
For example: in case of one GPU with gradient accumulation, I set the batch size to 3 and the accumulation step to 2 (so overall I have 6). In the case of 1 GPU setting without gradient accumulation I set the batch size to 6 and for the case of 2 GPUs, I set the overall batch size to 6 (so 3 on each GPU).