Slow gpu to cpu transfer

I am working on a research project and comparing semantic segmentation, object detection algorithms in terms of speed. The semantic segmentation model which I am using works around 10 fps faster than the object detection model. But when I add code to move tensor from GPU to CPU, the semantic model gets slower(probably because the tensor is of big size) and the fps difference is reduced to 2 fps.

prediction = prediction.cpu()

prediction is a 1*2*1920*1080 tensor.
I want to find contours of prediction once it is converted to NumPy array. Is there any way I can speed up the process either by speeding up conversion or performing the contour detection process on GPU(though I couldn’t find any GPU implementation.)
Any resource or ideas which can help in this situation would be of great help.

Transferring a tensor from the GPU to the CPU will synchronize the code and might thus reduce the overall performance of your script.
The transfer itself might not take a lot of time, but the code would have to wait until the GPU workload is finished to be able to transfer the result back to the CPU.