Transfer Semantic Segmentation result to CPU is slow

I trained a semantic segmentation model (FCN). Now I am writing an API where I can call the model and get back the segmentation result as an array. The inference of the model on the GPU is really fast (0,02 seconds). The problem is when I want to bring the output back to the CPU: it takes 3,8 seconds. My image resolution is really high (2048x1536). But even with a smaller resolution (250x1024, I cropped the input images) it takes 0.45 seconds. Is there a way to reduce this time or another way to improve performance?

#Inference
start = time.time()
with torch.no_grad():
    output = eval_model(input_batch)['out'][0]
torch.cuda.synchronize()
end = time.time()

#Tensor to CPU
start2 = time.time()
output_prediction = output.argmax(0).cpu().byte().numpy()
torch.cuda.synchronize()
end2 = time.time()

print(str(end-start)) #inference time
print(str(end2-start2)) #transfer time

I get a transfer time of approx. 700ms for an output of [1, 64, 2048, 1536] on my laptop, which shouldn’t have the fastest GPU connection compared to a server or workstation using:

x = torch.randn(1, 64, 2048, 1536, device='cuda')
nb_iters = 100

# warmup
for _ in range(10):
    y = x.cpu()

torch.cuda.synchronize()
t0 = time.time()
for _ in range(nb_iters):
    y = x.cpu()
torch.cuda.synchronize()
t1 = time.time()

print('{:.3f}ms'.format((t1 - t0)/nb_iters*1000))
1 Like

@ptrblck thank you very much. I found my mistake.