Does NCCL allreduce use fp16?

I tried to reduce communication bottleneck by using auto mixed precision training. I also used torch.profiler to check the training performance.

In the timeline, I noticed that the all-reduce kernel is ncclAllReduceRingLLKernel_sum_f32.

Does it mean the fp16 gradients are casted to fp32 to be all-reduced? Does something like ncclAllReduceRingLLKernel_sum_f16 exist?

My GPU is P100. PyTorch version is 1.9.1+cu111.

The parameters are stored in float32 using the automatic mixed-precision util. and thus also the gradients. NCCL thus communicates them in float32, too. If you are calling .half() on the model directly and thus apply a pure float16 training, NCCL should communicate in float16.

Thanks ptrblck! I’ll try half().

Sure, but be careful about the training stability, as amp makes sure that float32 is used in layers where it’s needed.

If I use torch.cuda.amp.autocast for mixed precision training, and I call model.half() to ensure that NCCL communicate in float16, do I need to call float() manually at some layers(like BatchNorm)?

autocast doesn’t expect a manual .half() call so you should either use amp or your manual mixed-precision recipe.

1 Like

Hi all,
I am confused here FP32 parameters are stored indeed as the master copy, but why not the gradients appear as FP16 to store and transfer to reduce the load as Nvidia Apex Philosophy details? Is there any tradeoff between the implementation and design of torch autocast?
Any feedback will be appreciated!

torch.cuda.amp is comparable to the deprecated O1 opt_level in apex, not O2, as the former was considered to be more stable.