Bfloat16 native support

I have a few questions about bfloat16

  1. how can I tell via pytorch if the gpu it’s running on supports bf16 natively? I tried:
$ python -c "import torch; print(torch.tensor(1).cuda().bfloat16().type())"

and it works on any card, whether it’s supported natively or not.

non-pytorch way will do too. I wasn’t able to find any.

  1. What’s the cost/overheard - how does pytorch handle bf16 on gpus that don’t have native support for it?

e.g. I’m trying to check whether rtx-3090 supports bf16 natively. The information is inconsistent - the Ampere arch supports bf16 but some comments I found suggest that the non-high end cards may have it disabled.

Thank you!

The GA102 whitepaper seems to indicate that the RTX cards do support bf16 natively (in particular p23 where they also state that GA102 doesn’t have fp64 tensor core support in contrast to GA100).

So in my limited understanding there are broadly three ways how PyTorch might use the GPU capabilities:

  • Use backend functions (like cuDNN, cuBlas) and hopefully they use all the latest and greatest.
  • When using intrinsics directly conventional wisdom (see this 2017 paper discussing half vs. half2 performance) seems to say that bfloat162 will offer better performance over using bfloat16 unless the compiler has learned lots of new tricks. But I am not aware if we actually use that a lot in PyTorch.

Again, I’m still looking into how get most fp16 performance, so take this with a grain of salt.

Best regards


1 Like

Thank you for the links, @tom, and attempting to answer my query.

I read that paper but also read comments on reddit where users suggested some features might have been disabled. I don’t want to propagate rumours of something I don’t really know first hand, hence trying to find a practical way to test whether a given gpu has a native support for a specific datatype.

I think the hardware whitepaper is probably the most official documentation.
If you wanted to make experiments to verify:

  • You can look at the generated kernels to see if the expected instructions are in there as a first step,
  • you can benchmark versus expectations (either comparing to A100 relative spee-up or to some theoretical promise of relative speed-up).

But both of them rely on you taking the right steps to unlock the speedups (see eg the requirements to use Tensor Cores), so it is tricky. If you are interested in ops supported by official NVIDIA libs, that might be a good way to test and here maybe (to my mind at least) CUTLASS is the thing that sticks out for being open source and very much at the cutting edge.

Best regards


1 Like

This approaches sounds doable, but complicated. I was hoping pytorch could provide users an API to query that, e.g. like it does with the current compute capacity of the gpu card, but I guess if CUDA doesn’t provide that API, then pytorch can’t either.

I found this compilation,

that also includes a spreedsheet with performance numbers for the different operations. So it contains the rtx-30* series too:

So it has TF32 numbers for Ampere cards but not bf16 yet.