Bfloat16 on nvidia V100 gpu

602389789 · April 26, 2024, 2:19am

Hello everyone!
It is said that bfloat16 is only supported on GPUs with compute capability of at least 8.0, which means nvidia V100 should not support bfloat16.

But I have test the code below on a V100 machine and run successfully.


a = torch.randn(3,3,dtype=torch.bfloat16,device="cuda")
b = torch.randn(3,3,dtype=torch.bfloat16,device="cuda")
c = torch.matmul(a,b)
print(c.dtype)
print(c.device)

and get the result

torch.bfloat16
cuda:0

but when I run print(torch.cuda.is_bf16_supported()), I got False

So what is situation here?

ptrblck · April 26, 2024, 3:27am

Creating tensors with bfloat16 might be supported on older architectures, but the actual compute kernels would not be.

602389789 · April 28, 2024, 3:37am

So does that mean although the type is bfloat16 but the computation is actually run in fp32 way in GPU?

ptrblck · April 28, 2024, 1:16pm

Yes, older hardware which does not support bfloat16 compute will emulate it via float32 compute.

602389789 · April 29, 2024, 3:41am

Thanks for you answer!
And I have tried to do bfloat16 mixed precision on V100, the time cost is almost the same as the full fp32 training (even a little slower).

too_many_watts · May 17, 2024, 4:17pm

Any source code pointers to where/how this emulation happens? I’m super curious about what’s going on under the hood.

Qichao_Ying_CN · January 5, 2025, 6:31am

hi! very strange. I am using DGX V100 and the following result

print(torch.cuda.is_bf16_supported())
True
Also I could train models using bf16 in Deepspeed

I am using CUDA 12.4, PyTorch 2.5.0 and NVIDIA driver 550.127.08.

ptrblck · January 5, 2025, 7:30pm

is_bf16_supported() checks is the used CUDA toolkit used to build PyTorch supported BF16 as well as if the compute capability of the device is Ampere+. However, even if these checks fail, the test will try to allocate a BF16 tensor and returns the support based on the success of the tensor creation.
IMO, creating a tensor alone is not sufficient to claim BF16 is supported as math operations could fail (and will in the case of Volta devices).