Overflow on CPU, but not GPU

import torch
torch.manual_seed(0)
x = torch.ones(1000000).half()

print(x.mean())
print(x.cuda().mean())
tensor(nan, dtype=torch.float16)
tensor(1., device='cuda:0', dtype=torch.float16)

Why?

I guess the accumulation kernel on the CPU might not be using float32 to represent the intermediate values as mixed-precision training with float16 is usually used on the GPU (if I’m not mistaken, bfloat16 is the preferred numerical format on the CPU, but let’s wait for others to chime in and correct me).