111429
(zuujhyt)
April 11, 2021, 9:53am
1
Hi, I am using apex and multi-node multi-gpu training.
I wonder what’s the recommended way to setup sync_bn across nodes/cards.
In Nvidia’s official apex Imagenet example , it uses apex.parallel.convert_syncbn_model()
In torchvision’s official video-classificaiton example , it uses torch.nn.SyncBatchNorm.convert_sync_batchnorm
What’ the difference, and the recommended usage for distributed training?
Thanks.
We recommend to use the native implementations for mixed-precision training as well as the DistributedDataParallel
utilities, as the apex implementations were used as a proff of concept and the native implementations will be improved in the future.
jiale
June 22, 2021, 12:45pm
3
Hi, does your “native” means that we sholud use “torch.nn.SyncBatchNorm.convert_sync_batchnorm + torch.nn.parallel.DistributedDataParallel” ?
Yes, by “native” I mean the classes and functions from the torch
namespace, not the (older and now deprecated) apex
implementations.