Bfloat16 native support

I have a few questions about bfloat16

  1. how can I tell via pytorch if the gpu it’s running on supports bf16 natively? I tried:
$ python -c "import torch; print(torch.tensor(1).cuda().bfloat16().type())"
torch.cuda.BFloat16Tensor

and it works on any card, whether it’s supported natively or not.

non-pytorch way will do too. I wasn’t able to find any.

  1. What’s the cost/overheard - how does pytorch handle bf16 on gpus that don’t have native support for it?

e.g. I’m trying to check whether rtx-3090 supports bf16 natively. The information is inconsistent - the Ampere arch supports bf16 but some comments I found suggest that the non-high end cards may have it disabled.

Thank you!

2 Likes

The GA102 whitepaper seems to indicate that the RTX cards do support bf16 natively (in particular p23 where they also state that GA102 doesn’t have fp64 tensor core support in contrast to GA100).

So in my limited understanding there are broadly three ways how PyTorch might use the GPU capabilities:

  • Use backend functions (like cuDNN, cuBlas) and hopefully they use all the latest and greatest.
  • When using intrinsics directly conventional wisdom (see this 2017 paper discussing half vs. half2 performance) seems to say that bfloat162 will offer better performance over using bfloat16 unless the compiler has learned lots of new tricks. But I am not aware if we actually use that a lot in PyTorch.

Again, I’m still looking into how get most fp16 performance, so take this with a grain of salt.

Best regards

Thomas

2 Likes

Thank you for the links, @tom, and attempting to answer my query.

I read that paper but also read comments on reddit where users suggested some features might have been disabled. I don’t want to propagate rumours of something I don’t really know first hand, hence trying to find a practical way to test whether a given gpu has a native support for a specific datatype.

I think the hardware whitepaper is probably the most official documentation.
If you wanted to make experiments to verify:

  • You can look at the generated kernels to see if the expected instructions are in there as a first step,
  • you can benchmark versus expectations (either comparing to A100 relative spee-up or to some theoretical promise of relative speed-up).

But both of them rely on you taking the right steps to unlock the speedups (see eg the requirements to use Tensor Cores), so it is tricky. If you are interested in ops supported by official NVIDIA libs, that might be a good way to test and here maybe (to my mind at least) CUTLASS is the thing that sticks out for being open source and very much at the cutting edge.

Best regards

Thomas

1 Like

This approaches sounds doable, but complicated. I was hoping pytorch could provide users an API to query that, e.g. like it does with the current compute capacity of the gpu card, but I guess if CUDA doesn’t provide that API, then pytorch can’t either.

I found this compilation,

https://machine-learning-note.readthedocs.io/en/latest/blog/cheatsheets/Nvidia_GPU.html

that also includes a spreedsheet with performance numbers for the different operations. So it contains the rtx-30* series too:

So it has TF32 numbers for Ampere cards but not bf16 yet.

Probably the last link was updated recently. It states that the Nvidia 30* series (Ampere) does support bfloat16.

You can also find the same information on Wikipedia.

Thanks for the update, Albert.

A lot of water has passed under the bridge since this discussion. We have integrated bf16 support in HF Transformers and even have benchmarks to show its performance under RTX-3090 and A100

There is an api now - just found out about it today:

torch.cuda.is_bf16_supported()
3 Likes

Thank you, Less

It was indeed added 8 months ago:

but with it not being documented, it’s hard to tell if it’s public or not.

at HF Transformers meanwhile we just copied it and extended it to our needs.

transformers/import_utils.py at d91841315aab55cf1347f4eb59332858525fad0f · huggingface/transformers · GitHub

2 Likes

I know this thread is quite old, but I have a similar confusion.

I am able to run

model.to(torch.bfloat16)

on older GPUs like V100, T4 and it does not complain even though bfloat16 is only supported on Ampere and above. The memory consumption is in line with half-precision, so then what does it mean when the docs say bfloat16 is only supported on Ampere and above?

How does it work? Does it mean on older GPUs the kernels copy tensors to fp32 for each operation?

Can I then also train (albeit slowly) in bfloat16 on older GPUs?

Any update on this? I’m having a similar confusion. I seem to be able to load models from transformers using torch.bfloat16 and have them run inference with no problem on T4 GPUs. I also checked some of the underlying torch tensors and they at least report to be of dtype=torch.bfloat16.

Even if there is no native hardware support, with recent CUDA you get “emulated” bf16 support where the internal computations are with fp32.
Regarding speed, you have offsetting effects:

  • generally, fallback and emulation is slow,
  • optimized kernels (e.g. cublas) are possibly missing,
  • you might still get an advantage from reduced memory transfer from main to local/cache memory.

Best regards

Thomas

1 Like

Ahh okay, that makes a lot of sense. Thanks for the info!