Hi, I am using apex and multi-node multi-gpu training.
I wonder what’s the recommended way to setup sync_bn across nodes/cards.
In Nvidia’s official apex Imagenet example, it uses
In torchvision’s official video-classificaiton example, it uses
What’ the difference, and the recommended usage for distributed training?
We recommend to use the native implementations for mixed-precision training as well as the
DistributedDataParallelutilities, as the apex implementations were used as a proff of concept and the native implementations will be improved in the future.
Hi, does your “native” means that we sholud use “torch.nn.SyncBatchNorm.convert_sync_batchnorm + torch.nn.parallel.DistributedDataParallel” ?
Yes, by “native” I mean the classes and functions from the
torch namespace, not the (older and now deprecated)