pytorch fp16 inference

The model is trained with fp32. I try to use .half() to change layers and inputs to fp16. Actually,it indeed accelerate the inference. But its acceleration effect is far away from twice of fp32. My platform is nvidia tx2. Its compute capability is 6.2. It supports fp16 very well. So what I want to ask is whether fp16 cannot be twice as fast as fp32 in pytorch. Looking for your reply.Thank you.

The performance gains depend on the operations, shapes, and other potential bottlenecks.
E.g. convolutions in cudnn7.2 and earlier needed to have input channels, output channels and batch size as multiple of 8 to be able to use TensorCores. This restriction was listed in cudnn7.3 and later.

GEMMs however should still use matrices with shapes of multiples of 8 to use TensorCores in cublas and cudnn.

Are you preloading the data or are you using e.g. a DataLoader?
Note that you might encounter new bottlenecks in your code, if the model was the previous one and now got accelerated.