Native Ditributed Data Parallel with Mixed Precision

I was wondering if gradients are scaled from fp16 to fp32 before all reducing as in Apex Distributed data parallel default flag allreduce_always_fp32 .

Also what is the equivalent of delay_allreduce in torch.nn.distributed.DistributedDataParallel ?

I wouldn’t assume to see any automatic transformation of gradients if no ddp hooks etc. are used.
The native mixed-precision training util. via torch.cuda.amp uses FP32 parameters and thus also FP32 gradients.

You could try to increase the bucket_cap_mb to kick off the gradient sync later, if needed (unsure if there is a cleaner way).

1 Like