Bfloat16 native support

stas · April 5, 2021, 7:41pm

I have a few questions about bfloat16

how can I tell via pytorch if the gpu it’s running on supports bf16 natively? I tried:

$ python -c "import torch; print(torch.tensor(1).cuda().bfloat16().type())"
torch.cuda.BFloat16Tensor

and it works on any card, whether it’s supported natively or not.

non-pytorch way will do too. I wasn’t able to find any.

What’s the cost/overheard - how does pytorch handle bf16 on gpus that don’t have native support for it?

e.g. I’m trying to check whether rtx-3090 supports bf16 natively. The information is inconsistent - the Ampere arch supports bf16 but some comments I found suggest that the non-high end cards may have it disabled.

Thank you!

tom · April 6, 2021, 7:50am

The GA102 whitepaper seems to indicate that the RTX cards do support bf16 natively (in particular p23 where they also state that GA102 doesn’t have fp64 tensor core support in contrast to GA100).

So in my limited understanding there are broadly three ways how PyTorch might use the GPU capabilities:

Use backend functions (like cuDNN, cuBlas) and hopefully they use all the latest and greatest.
When using intrinsics directly conventional wisdom (see this 2017 paper discussing half vs. half2 performance) seems to say that bfloat162 will offer better performance over using bfloat16 unless the compiler has learned lots of new tricks. But I am not aware if we actually use that a lot in PyTorch.

Again, I’m still looking into how get most fp16 performance, so take this with a grain of salt.

Best regards

Thomas

stas · April 7, 2021, 12:19am

Thank you for the links, @tom, and attempting to answer my query.

I read that paper but also read comments on reddit where users suggested some features might have been disabled. I don’t want to propagate rumours of something I don’t really know first hand, hence trying to find a practical way to test whether a given gpu has a native support for a specific datatype.

tom · April 7, 2021, 1:26pm

I think the hardware whitepaper is probably the most official documentation.
If you wanted to make experiments to verify:

You can look at the generated kernels to see if the expected instructions are in there as a first step,
you can benchmark versus expectations (either comparing to A100 relative spee-up or to some theoretical promise of relative speed-up).

But both of them rely on you taking the right steps to unlock the speedups (see eg the requirements to use Tensor Cores), so it is tricky. If you are interested in ops supported by official NVIDIA libs, that might be a good way to test and here maybe (to my mind at least) CUTLASS is the thing that sticks out for being open source and very much at the cutting edge.

Best regards

Thomas

stas · April 7, 2021, 8:54pm

This approaches sounds doable, but complicated. I was hoping pytorch could provide users an API to query that, e.g. like it does with the current compute capacity of the gpu card, but I guess if CUDA doesn’t provide that API, then pytorch can’t either.

I found this compilation,

https://machine-learning-note.readthedocs.io/en/latest/blog/cheatsheets/Nvidia_GPU.html

that also includes a spreedsheet with performance numbers for the different operations. So it contains the rtx-30* series too:

So it has TF32 numbers for Ampere cards but not bf16 yet.

AlbertZeyer · January 4, 2022, 10:02am

Probably the last link was updated recently. It states that the Nvidia 30* series (Ampere) does support bfloat16.

You can also find the same information on Wikipedia.

stas · January 4, 2022, 5:54pm

Thanks for the update, Albert.

A lot of water has passed under the bridge since this discussion. We have integrated bf16 support in HF Transformers and even have benchmarks to show its performance under RTX-3090 and A100

github.com/huggingface/transformers

[Benchmark] HF Trainer on RTX-3090

opened 05:56AM - 03 Dec 21 UTC

stas00

Benchmarks WIP

# 🖥 Benchmarking RTX-3090 `transformers` w/ HF Trainer: We are going to use a… special benchmarking tool that will do all the work for us. https://github.com/huggingface/transformers/pull/14934 This is the index post and specific benchmarks are in their own posts below: 1. [fp16 vs bf16 vs tf32 vs fp32](https://github.com/huggingface/transformers/issues/14608#issuecomment-1004390803) 2. [gradient accumulation steps](https://github.com/huggingface/transformers/issues/14608#issuecomment-1004392537) 3. [gradient checkpointing](https://github.com/huggingface/transformers/issues/14608#issuecomment-1004422281) 4. [batch size](https://github.com/huggingface/transformers/issues/14608#issuecomment-1004392537) 5. [optimizers](https://github.com/huggingface/transformers/pull/14744#issuecomment-1002398095) TODO: - redo optimizers once https://github.com/huggingface/transformers/pull/14744 gets merged - a combined benchmark with the slowest vs fastest configs - other suggestions?

github.com/huggingface/transformers

[WIP] [Benchmark] HF Trainer on A100

opened 05:41AM - 04 Jan 22 UTC

stas00

# 🖥 Benchmarking A100 `transformers` w/ HF Trainer: We are going to use a spe…cial benchmarking tool that will do all the work for us. https://github.com/huggingface/transformers/pull/14934 This is the index post and specific benchmarks are in their own posts below: Please ignore for now as I'm gathering results. This is the index post and specific benchmarks are in their own posts below: 1. [fp16 vs bf16 vs tf32 vs fp32](https://github.com/huggingface/transformers/issues/15026#issuecomment-1004543189) 2. ...

lessw2020 · April 21, 2022, 9:37pm

There is an api now - just found out about it today:

torch.cuda.is_bf16_supported()

stas · April 21, 2022, 9:59pm

Thank you, Less

It was indeed added 8 months ago:

but with it not being documented, it’s hard to tell if it’s public or not.

at HF Transformers meanwhile we just copied it and extended it to our needs.

transformers/import_utils.py at d91841315aab55cf1347f4eb59332858525fad0f · huggingface/transformers · GitHub

chiragjn · June 2, 2023, 7:17am

I know this thread is quite old, but I have a similar confusion.

I am able to run

model.to(torch.bfloat16)

on older GPUs like V100, T4 and it does not complain even though bfloat16 is only supported on Ampere and above. The memory consumption is in line with half-precision, so then what does it mean when the docs say bfloat16 is only supported on Ampere and above?

How does it work? Does it mean on older GPUs the kernels copy tensors to fp32 for each operation?

Can I then also train (albeit slowly) in bfloat16 on older GPUs?

sjrhl · November 3, 2023, 8:56am

Any update on this? I’m having a similar confusion. I seem to be able to load models from transformers using torch.bfloat16 and have them run inference with no problem on T4 GPUs. I also checked some of the underlying torch tensors and they at least report to be of dtype=torch.bfloat16.

tom · November 21, 2023, 11:05am

Even if there is no native hardware support, with recent CUDA you get “emulated” bf16 support where the internal computations are with fp32.
Regarding speed, you have offsetting effects:

generally, fallback and emulation is slow,
optimized kernels (e.g. cublas) are possibly missing,
you might still get an advantage from reduced memory transfer from main to local/cache memory.

Best regards

Thomas

sjrhl · December 1, 2023, 9:16am

Ahh okay, that makes a lot of sense. Thanks for the info!

victorx · April 26, 2024, 11:42pm

@tom I’m curious where you found this information. Do you mind sharing some link? I was trying to search online with "bf16 cuda emulation’, but could not find any relevant entries.

Klubokok · April 8, 2025, 7:42pm

python -c “import torch; import time; x = torch.randn(4096, 4096).cuda(); x_bf16 = x.to(torch.bfloat16); start = time.time(); y_bf16 = x_bf16 @ x_bf16; bf16_time = time.time()-start; x_fp32 = x.to(torch.float32); start = time.time(); y_fp32 = x_fp32 @ x_fp32; fp32_time = time.time()-start; print(f’BF16: {bf16_time:.3f}s | FP32: {fp32_time:.3f}s | Ratio: {fp32_time/bf16_time:.1f}x’)”
BF16: 0.027s | FP32: 0.055s | Ratio: 2.0x

1080ti pytorch2.6.0+cu124