Low inference speed with libtorch on win10

I want to deploy pytorch YOLO model in windows c++ environment.
The environment is libtorch1.4-debug + vs2017 + cuda10.1,
the c++ inference code can run successfully, but it takes about 0.16s for
an input image, as a comparison, the time is 0.03s in python environment.
And i also found that the GPU usage rate is rather low, any ideas on how to solve this problem?

How did you profile the libtorch and Python code?
Note that CUDA operations are asynchronous, so you would need to synchronize the code before starting and stopping the timer.

Plus: why do you use the debug version for benchmarking? The optimization switches are turned off in those builds.