This approaches sounds doable, but complicated. I was hoping pytorch could provide users an API to query that, e.g. like it does with the current compute capacity of the gpu card, but I guess if CUDA doesn’t provide that API, then pytorch can’t either.
I found this compilation,
https://machine-learning-note.readthedocs.io/en/latest/blog/cheatsheets/Nvidia_GPU.html
that also includes a spreedsheet with performance numbers for the different operations. So it contains the rtx-30* series too:
So it has TF32 numbers for Ampere cards but not bf16 yet.