Why dose pytorch run so slowly, when batch_size=1

The running time on gpu in testing (batch_size=1) is similar to in training(batch_size=128).
However using the same model on tensorflow, the running time on gpu in testing with 1 batch is faster than in training with 128 batch.
And running time on pytorch is faster than tensorflow in training but slower so much in testing.
It makes me very confuse😂, because in RL, I have to sample data.But it is so slowly!
This case is in Pytorch1.0.0 with Cuda10.

How did you time your code?
Note that CUDA calls are asynchronous, so that you would have to synchronize your code before starting and stopping your timer using:

t0 = time.perf_counter()
# Your code
t1 = time.perf_counter()

Also, the first CUDA call shouldn’t be timed, as it might have some overhead due to context creation.