Why libtorch cuda infer so slow? I don't know why?

Hello All!
I have write an AI program to test ‘.pt’ module.
but I found infer cost time is 360ms. core code

 auto startTime = std::chrono::high_resolution_clock::now();
    at::Tensor result = m_module.forward({tensor}).toTensor();
    at::cuda::CUDAStream stream = at::cuda::getCurrentCUDAStream();
    AT_CUDA_CHECK(cudaStreamSynchronize(stream));
    auto endTime = std::chrono::high_resolution_clock::now();

    float totalTime = std::chrono::duration<float, std::milli>(endTime - startTime).count();
    printf("(%s) >>> infer cost time = %.3f ms\n", m_name.c_str(), totalTime);
    return result;

output:

>>> infer cost time = 364.698 ms

My computer is hp with RTX-3070.
system: ubuntu 18.04 with linux-kernel-5.4.0-99-generic
libtorch: 1.8.0+cu111
nvidia-driver: 470.74
cuda-runtime: 11.1
cuda-driver: 11.4
Gpu info

Total amount of Global Memory:                  4051501056 bytes
Number of SMs:                                  40
Total amount of Constant Memory:                65536 bytes
Total amount of Shared Memory per block:        49152 bytes
Total number of registers available per block:  65536
Warp size:                                      32
Maximum number of threads per SM:               1536
Maximum number of threads per block:            1024
Maximum size of each dimension of a block:      1024 x 1024 x 64
Maximum size of each dimension of a grid:       2147483647 x 65535 x 65535
Maximum memory pitch:                           2147483647 bytes
Texture alignmemt:                              32 bytes
Clock rate:                                     1.29 GHz
Memory Clock rate:                              6001 MHz
Memory Bus Width:                               256-bit

more:

Thu Feb 10 17:45:28 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.74       Driver Version: 470.74       CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:01:00.0  On |                  N/A |
| N/A   45C    P8    16W /  N/A |    603MiB /  7959MiB |      2%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |

could anyone help me?