Fp16_compress_hook casts to FP16 before dividing by world_size — causes NaN with large gradients

In torch.distributed.algorithms.ddp_comm_hooks.default_hooks.fp16_compress_hook (PyTorch 2.4.0+cu121), the compression is implemented as:

compressed_tensor = buffer.to(dtype).div_(world_size)

This casts to FP16 before dividing by world_size. Any FP32 gradient value exceeding 65504 (FP16 max) overflows to inf during the cast, and inf / N remains inf. When multiple ranks contribute inf, the AllReduce sum produces NaN.

A safer approach would be to divide in FP32 first, then cast:

compressed_tensor = buffer.div(world_size).to(dtype)

This way, only gradients whose average exceeds FP16 range would overflow — which is far less likely in practice.

Is this a known issue, or is the current ordering intentional for some reason I’m missing? (e.g., memory considerations with gradient_as_bucket_view=True?)

For now, I’m using a custom comm hook with the division reordered, registered via model.register_comm_hook().