Hello,
I tried to reduce communication bottleneck by using auto mixed precision training. I also used torch.profiler to check the training performance.
In the timeline, I noticed that the all-reduce kernel is ncclAllReduceRingLLKernel_sum_f32.
Does it mean the fp16 gradients are casted to fp32 to be all-reduced? Does something like ncclAllReduceRingLLKernel_sum_f16 exist?
The parameters are stored in float32 using the automatic mixed-precision util. and thus also the gradients. NCCL thus communicates them in float32, too. If you are calling .half() on the model directly and thus apply a pure float16 training, NCCL should communicate in float16.
If I use torch.cuda.amp.autocast for mixed precision training, and I call model.half() to ensure that NCCL communicate in float16, do I need to call float() manually at some layers(like BatchNorm)?
Hi all,
I am confused here FP32 parameters are stored indeed as the master copy, but why not the gradients appear as FP16 to store and transfer to reduce the load as Nvidia Apex Philosophy details? Is there any tradeoff between the implementation and design of torch autocast?
Any feedback will be appreciated!