Our experience with Kinetics 400 using PyTorch 1.3 on a node with two GPUs as follows:
Single GPU > DP (-0.2%) > DP w/ sync BN (-0.3%)
Single GPU serves as the baseline for DP and DP w/ sync BN.
The tradeoff with distributed training is understandable but sync BN causing worse accuracy is not trivial to ignore.
My setting is same with you, just testing in HMDB51. I also get the results as follows:
DP>DP w/sync BN. Now, Do you find the solution to deal with this issue?
Maybe you are right. When using DDP+syncbn, bn is computated with a larger batch. The learning rate should be tuned a bit higher (original_lr * num_gpus).
Not sure this is the case for you, but in my case I was using autocast and GradScaler. I had both set to enabled=False. According to the docs this should mean these should have no effect, which was in fact the case with a single GPU and using DP.
However, with DDP I found that introducing these increased variance in the training and validation loss significantly, deteriorating model accuracy overall. According to the docs autocast and GradScaler shouldn’t adversely affect DDP, but it did just that in my case. Not sure why, but I assume it has to do with gradient synchronization in DDP.
Do you use ‘loss = losses.sum()’ ? DDP default the loss type is average form, and average the gradient over all ranks. If a sum form loss is used, it produces wrong result.