Can't get any inference speed up using FP16 in 2080ti compared with fp32

I convert a model trained in FP32, and I use model.half() and input.half() to do inference in fp16 precision, but the inference speed is almost same with fp32 in 2080ti, my batch size is fixed to 1, why is that? How can I speed up using fp16? Thanks.

Answered here.