Training performance degrades with DistributedDataParallel

FreemanG · February 14, 2020, 6:13am

Maybe you are right. When using DDP+syncbn, bn is computated with a larger batch. The learning rate should be tuned a bit higher (original_lr * num_gpus).

zeakey · April 20, 2020, 5:08am

Strangely, using DistributedSampler degrades the performance in my case .

I’m not sure about the effect of DistributedSampler in DistributedDataParallel.

soulslicer · May 11, 2020, 8:08pm

Can you tell me how this monkey patching is done? Which file is this

soulslicer · May 11, 2020, 8:24pm

Hi do you disable cudnn for the whole project, or just on the batchnorm files?

Hzzone · July 12, 2020, 5:16am

The main reason may be DDP estimates global var~(lr is another reason, but still decrease about 0.2%), see https://github.com/pytorch/pytorch/pull/14267#issuecomment-449125620

Sergii_Makarevych · July 19, 2020, 5:56pm

source code is pretty straightforward

TT_YY · September 26, 2020, 4:03am

Hi.

I got the same problem.
Updating pytorch to ver. 1.6.0 didn’t help, although it seems that they fixed several things in SyncBN.

Did anybody got an improvement?

rvarm1 · October 8, 2020, 9:18pm

Hi @TT_YY, Have you tried setting torch.backends.cudnn.enabled = False?

TT_YY · October 16, 2020, 6:46am

Hi @rvarm1

Thank you for your response and sorry for delay of my response.
I will try it.

Thanks.

mitchellnw · June 9, 2021, 6:25pm

Did you ever resolve this issue? Using DDP + SyncBN does not help.

dabs · January 27, 2022, 12:48am

Not sure this is the case for you, but in my case I was using autocast and GradScaler. I had both set to enabled=False. According to the docs this should mean these should have no effect, which was in fact the case with a single GPU and using DP.

However, with DDP I found that introducing these increased variance in the training and validation loss significantly, deteriorating model accuracy overall. According to the docs autocast and GradScaler shouldn’t adversely affect DDP, but it did just that in my case. Not sure why, but I assume it has to do with gradient synchronization in DDP.

hhc1997 · July 11, 2022, 1:24pm

Do you use ‘loss = losses.sum()’ ? DDP default the loss type is average form, and average the gradient over all ranks. If a sum form loss is used, it produces wrong result.

Bill_Wen · February 29, 2024, 6:44pm

Hi, I got the same issue. Did you solve the problem?