Correct way to use sync batch norm for using apex and DDP

Hi, I am using apex and multi-node multi-gpu training.
I wonder what’s the recommended way to setup sync_bn across nodes/cards.

In Nvidia’s official apex Imagenet example, it uses apex.parallel.convert_syncbn_model()

In torchvision’s official video-classificaiton example, it uses torch.nn.SyncBatchNorm.convert_sync_batchnorm

What’ the difference, and the recommended usage for distributed training?

We recommend to use the native implementations for mixed-precision training as well as the DistributedDataParallelutilities, as the apex implementations were used as a proff of concept and the native implementations will be improved in the future.

Hi, does your “native” means that we sholud use “torch.nn.SyncBatchNorm.convert_sync_batchnorm + torch.nn.parallel.DistributedDataParallel” ? :thinking:

Yes, by “native” I mean the classes and functions from the torch namespace, not the (older and now deprecated) apex implementations.