Training performance degrades with DistributedDataParallel

@Mr.Z Do you find the problem? I also get a very worse accuracy when use SyncBN + DDP for batchsize=16( 4 GPUs on one node, 4 images for each GPU), and when I use DataParallel + SyncBN, evrything is OK.