Cuda tensor to cpu is very time-consuming

(Ismymajia) #1

cpu tensor to cuda tensor is very fast. eg: 500x300 image from cpu to cuda tensor cost 2ms.
but 500x300 image from cuda to cpu tensor cost 100ms.

that is :
auto cuda_tensor = torch::autograd::make_variable(batchInputs, false).to(torch::kCUDA); //2ms
auto output = module->forward(cuda_tensor).toTensor(); //18ms
auto output1 = torch::squeeze(output); //0ms
auto output2 = torch::exp(output1); //0ms
auto image_cpu = output2.cpu(); // 100ms

output2.cpu() cost 100ms, how can this be optimized?

(Alban D) #2


Keep in mind that the cuda api is asynchronous and only synchronize if you ask for it explicitly with torch.cuda.synchronize() or if you read some value back to the cpu (by calling .cpu() for example).
So the timing you measure here for the line that contains .cpu() is actually the time of doing all computations + sending the tensor back.