Correct way to use sync batch norm for using apex and DDP

111429 · April 11, 2021, 9:53am

Hi, I am using apex and multi-node multi-gpu training.
I wonder what’s the recommended way to setup sync_bn across nodes/cards.

In Nvidia’s official apex Imagenet example, it uses apex.parallel.convert_syncbn_model()

In torchvision’s official video-classificaiton example, it uses torch.nn.SyncBatchNorm.convert_sync_batchnorm

What’ the difference, and the recommended usage for distributed training?
Thanks.

ptrblck · April 11, 2021, 10:38pm

We recommend to use the native implementations for mixed-precision training as well as the DistributedDataParallelutilities, as the apex implementations were used as a proff of concept and the native implementations will be improved in the future.

jiale · June 22, 2021, 12:45pm

Hi, does your “native” means that we sholud use “torch.nn.SyncBatchNorm.convert_sync_batchnorm + torch.nn.parallel.DistributedDataParallel” ?

ptrblck · June 22, 2021, 5:17pm

Yes, by “native” I mean the classes and functions from the torch namespace, not the (older and now deprecated) apex implementations.