I trained a semantic segmentation model (FCN). Now I am writing an API where I can call the model and get back the segmentation result as an array. The inference of the model on the GPU is really fast (0,02 seconds). The problem is when I want to bring the output back to the CPU: it takes 3,8 seconds. My image resolution is really high (2048x1536). But even with a smaller resolution (250x1024, I cropped the input images) it takes 0.45 seconds. Is there a way to reduce this time or another way to improve performance?
#Inference
start = time.time()
with torch.no_grad():
output = eval_model(input_batch)['out'][0]
torch.cuda.synchronize()
end = time.time()
#Tensor to CPU
start2 = time.time()
output_prediction = output.argmax(0).cpu().byte().numpy()
torch.cuda.synchronize()
end2 = time.time()
print(str(end-start)) #inference time
print(str(end2-start2)) #transfer time