DDP on 8 GPUs vs. Single GPU training speed

Hi.
As I mentioned at title, I trained my model in 2 different device environments to compare training speed.
I used torchvision official resnet50 model and CIFAR10 for dataset, which is enough small to run in single GPU.
I found that DDP on 8 GPUs are about 2x slower than single GPU.
Is this expected for small models? Or am I using something wrong with DDP?

+)
I used same parameters for 2 environments, including batch size. Should I incrase batch size for 8 GPUs DDP 8x times? Does this (hopefully always) guarantees similar accuracy with single GPU cases?

I used same parameters for 2 environments, including batch size. Should I incrase batch size for 8 GPUs DDP 8x times? Does this (hopefully always) guarantees similar accuracy with single GPU cases?

If this is per-process batch size, DDP batch size should actually be 1/8 compared to local training, so that DDP and local both collectively process the same number of samples in each iteration.

If this is already global batch size, can you try to increase the batch size for both DDP and local trianing and see how that changes the perf numbers? DDP would run allreduce on model gradients. So if the batch size is too small, the communication overhead can overshadow the speedup from parallelizing computations.

Maybe you can find a solution together: