Cuda tensor to cpu is very time-consuming

cpu tensor to cuda tensor is very fast. eg: 500x300 image from cpu to cuda tensor cost 2ms.
but 500x300 image from cuda to cpu tensor cost 100ms.

that is :
auto cuda_tensor = torch::autograd::make_variable(batchInputs, false).to(torch::kCUDA); //2ms
auto output = module->forward(cuda_tensor).toTensor(); //18ms
auto output1 = torch::squeeze(output); //0ms
auto output2 = torch::exp(output1); //0ms
auto image_cpu = output2.cpu(); // 100ms

output2.cpu() cost 100ms, how can this be optimized?

Hi,

Keep in mind that the cuda api is asynchronous and only synchronize if you ask for it explicitly with torch.cuda.synchronize() or if you read some value back to the cpu (by calling .cpu() for example).
So the timing you measure here for the line that contains .cpu() is actually the time of doing all computations + sending the tensor back.

Hi, I have the same problem here. But how to use torch.cuda.synchronize() in C++?

This should work:

CUDAStream stream = getCurrentCUDAStream();
AT_CUDA_CHECK(cudaStreamSynchronize(stream));

Thank you very much, it works for me. But I have another question. After I add cudaStreamSynchronize in my code, the time consumption of tensor.cpu() does reduced. However, I found out that cudaStreamSynchronize itself cost lots of time, like 70ms which is almost the same time tensor.cpu() cost before I add cudaStreamSynchronize to the code. So, whether I use cudaStreamSynchronize or not, this extra time consumption is inevitable, is that right? If it is right, is there any way to reduce such time? Like reduce image number of the input tensor?

This is expected, since the cpu() call is blocking and waiting for all asynchronous CUDA operations to finish. If you don’t synchronize manually, the cpu() call will add the synchronization and thus accumulate the CUDA kernel runtime.
Since you are now synchronizing manually, the runtime is spent in the kernel and the cpu() call executes just the device2host copy.

Yes, you could reduce the batch size or the spatial size of the input images.
However, note that while reducing the workload the CUDA operations might reduce the wall-clock time, but the overall performance might be reduces, if you calculate the time/inputs.

I understand. Thank you very much.