Does NCCL allreduce use fp16?

zzibc · January 12, 2022, 9:22am

Hello，
I tried to reduce communication bottleneck by using auto mixed precision training. I also used torch.profiler to check the training performance.

In the timeline, I noticed that the all-reduce kernel is ncclAllReduceRingLLKernel_sum_f32.

Does it mean the fp16 gradients are casted to fp32 to be all-reduced? Does something like ncclAllReduceRingLLKernel_sum_f16 exist?

My GPU is P100. PyTorch version is 1.9.1+cu111.

ptrblck · January 13, 2022, 5:32am

The parameters are stored in float32 using the automatic mixed-precision util. and thus also the gradients. NCCL thus communicates them in float32, too. If you are calling .half() on the model directly and thus apply a pure float16 training, NCCL should communicate in float16.

zzibc · January 13, 2022, 5:46am

Thanks ptrblck! I’ll try half().

ptrblck · January 13, 2022, 5:47am

Sure, but be careful about the training stability, as amp makes sure that float32 is used in layers where it’s needed.

zzibc · January 13, 2022, 6:20am

If I use torch.cuda.amp.autocast for mixed precision training, and I call model.half() to ensure that NCCL communicate in float16, do I need to call float() manually at some layers(like BatchNorm)?

ptrblck · January 13, 2022, 6:30am

autocast doesn’t expect a manual .half() call so you should either use amp or your manual mixed-precision recipe.

epoch · January 14, 2022, 7:14am

Hi all,
I am confused here FP32 parameters are stored indeed as the master copy, but why not the gradients appear as FP16 to store and transfer to reduce the load as Nvidia Apex Philosophy details? Is there any tradeoff between the implementation and design of torch autocast?
Any feedback will be appreciated!

ptrblck · January 14, 2022, 5:34pm

torch.cuda.amp is comparable to the deprecated O1 opt_level in apex, not O2, as the former was considered to be more stable.