(The third column means the time consumption(unit: μs))
The three images are from three slightly different code snippets. Anyway, there will always be a line that takes up 60ms.
I would like to know why this happens and how to shorten the time?
All “slow” lines contain a cpu() call, which will create a synchronization if your script runs on the GPU.
To properly time CUDA code, you should synchronize before starting and stopping the timer (if you are manually profiling).