How to speed up image segmentation results transfer from GPU to CPU

I am running an image segmentation network in C++ for in production application. The output of my network is a 1x1x640x400 tensor. My forward is taking 3 ms on GPU, but the problem is that the transfer from GPU to CPU is 22 ms long.

Is that because of the size of my output tensor ? I saw I need to synchronize CUDA stream, but

at::cuda::CUDAStream stream = at::cuda::getCurrentCUDAStream();
AT_CUDA_CHECK(cudaStreamSynchronize(stream));

is not recognized in libtorch 1.7.1 cuda 10.2.

Is there a way to speed up the transfert ?

The transfer time is defined by e.g. the connection of the GPU to the host.
For your particular issue, make sure you are timing the operations correctly. Since CUDA operations are executed asynchronously, you would have to synchronize the code before starting and stopping the timer.
If you don’t synchronize, you might be profiling the kernel launch (not the forward pass), while the transfer to the host would automatically synchronize the code and accumulate the forward pass + the transfer time.