I am running an image segmentation network in C++ for in production application. The output of my network is a 1x1x640x400 tensor. My forward is taking 3 ms on GPU, but the problem is that the transfer from GPU to CPU is 22 ms long.
Is that because of the size of my output tensor ? I saw I need to synchronize CUDA stream, but
at::cuda::CUDAStream stream = at::cuda::getCurrentCUDAStream();
AT_CUDA_CHECK(cudaStreamSynchronize(stream));
is not recognized in libtorch 1.7.1 cuda 10.2.
Is there a way to speed up the transfert ?