Bfloat16 on nvidia V100 gpu

Hello everyone!
It is said that bfloat16 is only supported on GPUs with compute capability of at least 8.0, which means nvidia V100 should not support bfloat16.

But I have test the code below on a V100 machine and run successfully.


a = torch.randn(3,3,dtype=torch.bfloat16,device="cuda")
b = torch.randn(3,3,dtype=torch.bfloat16,device="cuda")
c = torch.matmul(a,b)
print(c.dtype)
print(c.device)

and get the result

torch.bfloat16
cuda:0

but when I run print(torch.cuda.is_bf16_supported()), I got False

So what is situation here?

Creating tensors with bfloat16 might be supported on older architectures, but the actual compute kernels would not be.

So does that mean although the type is bfloat16 but the computation is actually run in fp32 way in GPU?

Yes, older hardware which does not support bfloat16 compute will emulate it via float32 compute.

Thanks for you answer!
And I have tried to do bfloat16 mixed precision on V100, the time cost is almost the same as the full fp32 training (even a little slower).