Training performance degrades with DistributedDataParallel

Maybe you are right. When using DDP+syncbn, bn is computated with a larger batch. The learning rate should be tuned a bit higher (original_lr * num_gpus).