Debugging slow GPU operations

So I have a network implementation and two machines of same hardware.

precisely

----------------------  --------------------------------------------------------------------------------
sys.platform            linux
Python                  3.6.9 (default, Oct  8 2020, 12:12:24) [GCC 8.4.0]
numpy                   1.19.2
detectron2              0.3 @/home/roboeye-2/repos/roboeye/nova/src/detectron2
Compiler                GCC 7.5
CUDA compiler           CUDA 11.1
detectron2 arch flags   7.5
DETECTRON2_ENV_MODULE   <not set>
PyTorch                 1.7.1 @/home/roboeye-2/.virtualenvs/nova/lib/python3.6/site-packages/torch
PyTorch debug build     False
GPU available           True
GPU 0                   GeForce RTX 2080 Ti (arch=7.5)
CUDA_HOME               /usr/local/cuda
Pillow                  7.2.0
torchvision             0.8.2 @xxxx/lib/python3.6/site-packages/torchvision
torchvision arch flags  3.5, 5.0, 6.0, 7.0, 7.5
fvcore                  0.1.3.post20210306
cv2                     4.4.0
----------------------  --------------------------------------------------------------------------------

on one machine forward operation takes only 0.5 seconds, 40% of GPU utils
however, on the other machine, the same operation takes 22 seconds, 100% of GPU utils

Since all the hardware setting is same, code is identical, I am quite lost on what is causing the issue.

Can anyone help me with what I can try?

Are you using the same PyTorch installation (conda binaries, pip wheels, or a source build) on both machines?
If so, are you also using the same CUDA and cudnn version? In case you are using cudnn (e.g. in conv layers), are you using torch.backends.cudnn.benchmark = True?