Big accuracy gap with DDP

I did some experiments to understand better what is happening here.

DOES IT DEPEND ON THE NUMBER OF GPUs?
Results without adapting parameters (3 epochs):
No DDP (1 GPU): 64.77 => 72.21 => 78.21
DDP (1 GPUs): 64.77 => 72.21 => 78.21
DDP (2 GPUs): 63.96 => 70.80 => 77.20
DDP (3 GPUs): 58.61 => 65.05 => 75.30
DDP (4 GPUs): 55.38 => 70.27 => 72.65
DDP (5 GPUs): 56.22 => 67.60 => 71.87
DDP (6 GPUs): 50.57 => 63.38 => 71.02
DDP (7 GPUs): 49.85 => 65.46 => 67.44
DDP (8 GPUs): 49.34 => 59.54 => 65.89
So it does not suddenly decrease as soon as I use DDP.

IS IT AN ISSUE OF COMMUNICATION BETWEEN PROCESS?
Results without distributed sampler (i.e. each GPU train on all dataset) (3 epochs):
DDP (3 GPUs): 62.88 => 72.66 => 77.35
DDP (5 GPUs): 64.83 => 72.58 => 78.16
DDP (8 GPUs): 63.53 => 72.76 => 77.96
It is similar to using 1 GPU, so apparently the issue isn’t due to communication issue between devices.
To confirm this, I also printed for each process the gradient after calling opti.step (identical on
all process, as expected) and the validation accuracy (similar on all process, as expected).

IS IT AN ISSUE OF HYPER-PARAMETERS?
Results when adapting parameters (3 epochs):
With lr = lr * sqrt(world_size) and batch_size = batch_size * world_size:
DDP (8 GPUs): 14.15 => 24.93 => 32.97
With lr = lr * world_size and batch_size = batch_size * world_size
DDP (8 GPUs): 15.69 => 25.53 => 27.07
With lr = lr * world_size (batch_size unmodified)
DDP (8 GPUs): 45.98 => 55.75 => 67.46
With lr = lr * sqrt(world_size) (batch_size unmodified)
DDP (8 GPUs): 51.98 => 60.27 => 69.02
Note that if I apply lr * sqrt(8) when using 1 GPU I get:
No DDP (1 GPU): 60.44 => 69.09 => 76.56 (worst)
So it seems like lr * sqrt(batch_size) and batch_size unmodified is the right way.
However, it is not the only issue as the gap is still important.
I will mention this adapted lr as “effective lr”, from now on.

DOES THE ISSUE EXIST WHEN TRAINING FOR MORE EPOCHS?
Results when training for more epochs (50 epochs):
No DDP (1 GPU): 89.85
DDP (8 GPUs): 88.37
DDP (8 GPUs): 90.05 (using effective lr)
So the effective lr seems to fix the issue when the number of epoch increase (I also did the same test with another model on Imagenette and observed a similar behavior).
This behavior becomes really obvious when we take a look at the gap in function of the number of epochs of training.
Accuracy gap per number of epochs ( No DDP 1 GPU vs DDP 8 GPUS with effective lr):
EPOCHS | gap (1gpu/8gpu)
3 epochs | 9.19 (78.21/69.02)
4 epochs | 3.15 (80.42/77.27)
5 epochs | 2.04 (82.18/80.14)
6 epochs | 2.10 (83.67/81.57)
7 epochs | 1.28 (84.14/82.86)
8 epochs | 1.02 (84.91/83.89)
9 epochs | 0.67 (85.34/84.67)

SIMILAR DISCUSSIONS
I found similar discussions, but none of them solve my issue:

It is better when increasing the lr, but it doesn’t seem to be the main issue here since the gap is still important (see “IS IT AN ISSUE OF HYPER-PARAMETERS?”).

Not really the same issue but I got inspired and tried batch_size / world_size:
DDP (8 GPUs): 56.87 => 73.91 => 77.96
However, since there is world_size times more backward calls, it takes the same time as training on one GPU so it isn’t a viable solution.