Libtorch tensor.to(torch::kCPU) took a lot of time

I spent a lot of time converting tensor to mat.
My algorithm time is only 20ms, while the conversion takes 80ms.

cv::Mat tensor2Mat(torch::Tensor &i_tensor)
{
	int height = i_tensor.size(0), width = i_tensor.size(1);
	i_tensor = i_tensor.to(torch::kCPU);
	cv::Mat o_Mat(cv::Size(width, height), CV_32F, i_tensor.data_ptr());
	return o_Mat;
}

Is there any other solution?
CPU:Intel(R) Xeon(R) CPU E5-2678 v3 @ 2.50GHz
GPU:1080Ti
Windows 10

Pushing a CUDATensor to the CPU is a synchronizing operation in the default setup and will thus accumulate the time unless you manually synchronize the code.
You could try to use the non_blocking option, but would still need to wait until the data transfer is done and based on your code you are directly using the CPUTensor afterwards, so you wouldn’t gain any perf. improvements.