and it works on any card, whether it’s supported natively or not.
non-pytorch way will do too. I wasn’t able to find any.
What’s the cost/overheard - how does pytorch handle bf16 on gpus that don’t have native support for it?
e.g. I’m trying to check whether rtx-3090 supports bf16 natively. The information is inconsistent - the Ampere arch supports bf16 but some comments I found suggest that the non-high end cards may have it disabled.
Thank you for the links, @tom, and attempting to answer my query.
I read that paper but also read comments on reddit where users suggested some features might have been disabled. I don’t want to propagate rumours of something I don’t really know first hand, hence trying to find a practical way to test whether a given gpu has a native support for a specific datatype.
I think the hardware whitepaper is probably the most official documentation.
If you wanted to make experiments to verify:
You can look at the generated kernels to see if the expected instructions are in there as a first step,
you can benchmark versus expectations (either comparing to A100 relative spee-up or to some theoretical promise of relative speed-up).
But both of them rely on you taking the right steps to unlock the speedups (see eg the requirements to use Tensor Cores), so it is tricky. If you are interested in ops supported by official NVIDIA libs, that might be a good way to test and here maybe (to my mind at least) CUTLASS is the thing that sticks out for being open source and very much at the cutting edge.
This approaches sounds doable, but complicated. I was hoping pytorch could provide users an API to query that, e.g. like it does with the current compute capacity of the gpu card, but I guess if CUDA doesn’t provide that API, then pytorch can’t either.