How to quickly transfer data from cuda to cpu with libtorch?

cyanM · May 9, 2019, 10:36am

I use a resnet to classify images, and find that it is slower to get the result back to the CPU than forward(), and far slower than put image data to CUDA as well.
Does anyone know how to increase the speed of transferring data back to cpu?

os:
windows 7
cuda9.2
libtorch1.0-nightly-release
visual studio 2019

%E6%8D%95%E8%8E%B7

relative code is as follows:

t1 = std::chrono::steady_clock::now();

at::Tensor tmpData2 = torch::from_blob(tmp, { resNetParam.batchSize, resNetParam.roiSize,resNetParam.roiSize,resNetParam.imgDepth }, torch::kFloat).to(m_device);
tmpData2 = tmpData2.permute({ 0, 3, 1, 2 });

t2 = std::chrono::steady_clock::now();
time_used = std::chrono::duration_cast<std::chrono::duration>(t2 - t1) * 1000;
printf(“putIntoCuda %.3f ms \n”, time_used);

t1 = std::chrono::steady_clock::now();

torch::Tensor out = m_model->forward({ tmpData2 }).toTensor();

t2 = std::chrono::steady_clock::now();
time_used = std::chrono::duration_cast<std::chrono::duration>(t2 - t1) * 1000;
printf(“forward %.3f ms \n”, time_used);

std::tuple<torch::Tensor, torch::Tensor> result = out.sort(-1, true);

t1 = std::chrono::steady_clock::now();

torch::Tensor sortedScores = std::get<0>(result).to(torch::kCPU);
torch::Tensor sortedIdx = std::get<1>(result).toType(torch::kInt32).to(torch::kCPU);

t2 = std::chrono::steady_clock::now();
time_used = std::chrono::duration_cast<std::chrono::duration>(t2 - t1) * 1000;
printf(“toCPU %.3f ms \n”, time_used);

tom · May 9, 2019, 2:32pm

These measurements are quite likely off because you don’t seem to synchronize between calls. That means that the part measured as “toCPU” likely involves bits from before.

Best regards

Thomas

cyanM · May 10, 2019, 12:10am

13E67AA7-5DE6-40c9-A29C-6E61F6F93CD3
you’re right! thx

EsdeathYZH · May 10, 2019, 1:59am

I’m new to libtorch and what’s the meaning of “synchronize between calls”?
Could you share the modified code? Thank you!

cyanM · May 10, 2019, 5:53am

Just add cudaDeviceSynchronize() after ‘forward()’.
(according to nvidia doc, cudaDeviceSynchronize() could be used to wait for GPU compute to finish)
However, this command has no effect on the speed of the execution,just let cpu count the elapsed time correctly.

kevin_wang · October 10, 2019, 11:37am

The forward time is so slow?

cyanM · October 11, 2019, 1:19am

Generally ,yes. It depends on your net and input tensor’s size.

FantasyJXF · February 4, 2021, 1:36pm

I met the same problem, pytorch code works fine but libtorch very slow, no matter whether I add c10::cuda::CUDACachingAllocator::emptyCache(); or not.