Bfloat16 native support

This approaches sounds doable, but complicated. I was hoping pytorch could provide users an API to query that, e.g. like it does with the current compute capacity of the gpu card, but I guess if CUDA doesn’t provide that API, then pytorch can’t either.

I found this compilation,

https://machine-learning-note.readthedocs.io/en/latest/blog/cheatsheets/Nvidia_GPU.html

that also includes a spreedsheet with performance numbers for the different operations. So it contains the rtx-30* series too:

So it has TF32 numbers for Ampere cards but not bf16 yet.