Non Blocking copy from CPU to GPU takes long time

Hi guys,

I have been trying to do copy tensor from CPU to GPU with non blocking following tutorial( by doing below:

using namespace std::chrono;

int main()
auto t0 = high_resolution_clock::now();
torch::Tensor cpu_tensor = torch::randn({128, 3, 24, 24}, torch::kFloat);
torch::Tensor gpu_tensor =,/non_blocking/true);
auto t1 = high_resolution_clock::now();
std::cout<<“Time Taken:”<<(static_cast((duration_cast(t1 - t0)).count()))/1e6<<std::endl;
return 0;

Result: “Time Taken: 0.08”
Which is very slow in comparison to what it would take in python which is 0.0001 seconds. I was wondering if anyone can tell me if I am doing anything/ using the API wrong. Thanks in advance for your answers guys :slight_smile:

Don’t measure this for the first time you call cuda allocator. It might be slow the first cuda allocator is called. If you want to measure this. Run the to() function several times and then measure its running time, this would be more accurate.

1 Like

In addition to what @glaringlee said, the CPU tensors must also be in pinned memory for the copy to be non-blocking.

torch::Tensor cpu_tensor = torch::randn({128, 3, 24, 24}, torch::kFloat).pin_memory();
1 Like

Thanks Glaringlee, would try that