Bfloat16 on nvidia V100 gpu

Hello everyone!
It is said that bfloat16 is only supported on GPUs with compute capability of at least 8.0, which means nvidia V100 should not support bfloat16.

But I have test the code below on a V100 machine and run successfully.


a = torch.randn(3,3,dtype=torch.bfloat16,device="cuda")
b = torch.randn(3,3,dtype=torch.bfloat16,device="cuda")
c = torch.matmul(a,b)
print(c.dtype)
print(c.device)

and get the result

torch.bfloat16
cuda:0

but when I run print(torch.cuda.is_bf16_supported()), I got False

So what is situation here?

Creating tensors with bfloat16 might be supported on older architectures, but the actual compute kernels would not be.

So does that mean although the type is bfloat16 but the computation is actually run in fp32 way in GPU?

Yes, older hardware which does not support bfloat16 compute will emulate it via float32 compute.

Thanks for you answer!
And I have tried to do bfloat16 mixed precision on V100, the time cost is almost the same as the full fp32 training (even a little slower).

Any source code pointers to where/how this emulation happens? I’m super curious about what’s going on under the hood.

hi! very strange. I am using DGX V100 and the following result

print(torch.cuda.is_bf16_supported())
True
Also I could train models using bf16 in Deepspeed

I am using CUDA 12.4, PyTorch 2.5.0 and NVIDIA driver 550.127.08.

is_bf16_supported() checks is the used CUDA toolkit used to build PyTorch supported BF16 as well as if the compute capability of the device is Ampere+. However, even if these checks fail, the test will try to allocate a BF16 tensor and returns the support based on the success of the tensor creation.
IMO, creating a tensor alone is not sufficient to claim BF16 is supported as math operations could fail (and will in the case of Volta devices).