In torch.distributed.algorithms.ddp_comm_hooks.default_hooks.fp16_compress_hook (PyTorch 2.4.0+cu121), the compression is implemented as:
compressed_tensor = buffer.to(dtype).div_(world_size)
This casts to FP16 before dividing by world_size. Any FP32 gradient value exceeding 65504 (FP16 max) overflows to inf during the cast, and inf / N remains inf. When multiple ranks contribute inf, the AllReduce sum produces NaN.
A safer approach would be to divide in FP32 first, then cast:
compressed_tensor = buffer.div(world_size).to(dtype)
This way, only gradients whose average exceeds FP16 range would overflow — which is far less likely in practice.
Is this a known issue, or is the current ordering intentional for some reason I’m missing? (e.g., memory considerations with gradient_as_bucket_view=True?)
For now, I’m using a custom comm hook with the division reordered, registered via model.register_comm_hook().