I was wondering if gradients are scaled from fp16 to fp32 before all reducing as in Apex Distributed data parallel default flag allreduce_always_fp32 .
Also what is the equivalent of delay_allreduce in torch.nn.distributed.DistributedDataParallel ?
I was wondering if gradients are scaled from fp16 to fp32 before all reducing as in Apex Distributed data parallel default flag allreduce_always_fp32 .
Also what is the equivalent of delay_allreduce in torch.nn.distributed.DistributedDataParallel ?
I wouldn’t assume to see any automatic transformation of gradients if no ddp hooks etc. are used.
The native mixed-precision training util. via torch.cuda.amp
uses FP32
parameters and thus also FP32 gradients.
You could try to increase the bucket_cap_mb
to kick off the gradient sync later, if needed (unsure if there is a cleaner way).