PyTorch fp16 CUDA slow

I’m recently developing a new layer type with pytorch 1.1 CUDA extension. I’ve followed the official tutorial and used the macro AT_DISPATCH_FLOATING_TYPES_AND_HALF to generate support for fp16. Other than this, my code has no special treatment for fp16. I took care to cast all floating point constants in my code with static_cast<T> (where T is the template scalar type).

However, in all my benchmarks, fp16 is slower than fp32 for my kernel (> 30%).
FP16 should be faster in both GPU memory access and arithmetic, no?
I’m quite new to CUDA programming and might have missed something simple. Can someone please point out any likely cause?

My test hardware is Titan Turing RTX 24G with CUDA10.

1 Like