Pytorch 1.5+ is slower than pytorch 1.3

Hi, I use solov2 for instance segmentation and I found that the inference speed of using pytorch 1.5+ is about 3 times slower than pytorch 1.3.

The codes are almost the same except changing AT_CHECK to TORCH_CHECK in pytorch 1.5+.

The base docker images are pytorch/pytorch:1.3-cuda10.1-cudnn7-devel , pytorch/pytorch:1.9.0-cuda11.1-cudnn8-devel and 1.6.0-cuda10.1-cudnn7-devel.