Hello everyone!
It is said that bfloat16 is only supported on GPUs with compute capability of at least 8.0, which means nvidia V100 should not support bfloat16.
But I have test the code below on a V100 machine and run successfully.
a = torch.randn(3,3,dtype=torch.bfloat16,device="cuda")
b = torch.randn(3,3,dtype=torch.bfloat16,device="cuda")
c = torch.matmul(a,b)
print(c.dtype)
print(c.device)
and get the result
torch.bfloat16
cuda:0
but when I run print(torch.cuda.is_bf16_supported()), I got False
So what is situation here?
Creating tensors with bfloat16
might be supported on older architectures, but the actual compute kernels would not be.
So does that mean although the type is bfloat16 but the computation is actually run in fp32 way in GPU?
Yes, older hardware which does not support bfloat16
compute will emulate it via float32
compute.
Thanks for you answer!
And I have tried to do bfloat16 mixed precision on V100, the time cost is almost the same as the full fp32 training (even a little slower).
Any source code pointers to where/how this emulation happens? I’m super curious about what’s going on under the hood.
hi! very strange. I am using DGX V100 and the following result
print(torch.cuda.is_bf16_supported())
True
Also I could train models using bf16 in Deepspeed
I am using CUDA 12.4, PyTorch 2.5.0 and NVIDIA driver 550.127.08.
is_bf16_supported()
checks is the used CUDA toolkit used to build PyTorch supported BF16 as well as if the compute capability of the device is Ampere+. However, even if these checks fail, the test will try to allocate a BF16
tensor and returns the support based on the success of the tensor creation.
IMO, creating a tensor alone is not sufficient to claim BF16
is supported as math operations could fail (and will in the case of Volta devices).