How to quickly transfer data from cuda to cpu with libtorch?

#1

I use a resnet to classify images, and find that it is slower to get the result back to the CPU than forward(), and far slower than put image data to CUDA as well.
Does anyone know how to increase the speed of transferring data back to cpu?

os:
windows 7
cuda9.2
libtorch1.0-nightly-release
visual studio 2019

%E6%8D%95%E8%8E%B7

relative code is as follows:

t1 = std::chrono::steady_clock::now();

at::Tensor tmpData2 = torch::from_blob(tmp, { resNetParam.batchSize, resNetParam.roiSize,resNetParam.roiSize,resNetParam.imgDepth }, torch::kFloat).to(m_device);
tmpData2 = tmpData2.permute({ 0, 3, 1, 2 });

t2 = std::chrono::steady_clock::now();
time_used = std::chrono::duration_cast<std::chrono::duration>(t2 - t1) * 1000;
printf(“putIntoCuda %.3f ms \n”, time_used);

t1 = std::chrono::steady_clock::now();

torch::Tensor out = m_model->forward({ tmpData2 }).toTensor();

t2 = std::chrono::steady_clock::now();
time_used = std::chrono::duration_cast<std::chrono::duration>(t2 - t1) * 1000;
printf(“forward %.3f ms \n”, time_used);

std::tuple<torch::Tensor, torch::Tensor> result = out.sort(-1, true);

t1 = std::chrono::steady_clock::now();

torch::Tensor sortedScores = std::get<0>(result).to(torch::kCPU);
torch::Tensor sortedIdx = std::get<1>(result).toType(torch::kInt32).to(torch::kCPU);

t2 = std::chrono::steady_clock::now();
time_used = std::chrono::duration_cast<std::chrono::duration>(t2 - t1) * 1000;
printf(“toCPU %.3f ms \n”, time_used);

(Thomas V) #2

These measurements are quite likely off because you don’t seem to synchronize between calls. That means that the part measured as “toCPU” likely involves bits from before.

Best regards

Thomas

#3

13E67AA7-5DE6-40c9-A29C-6E61F6F93CD3
you’re right! thx

(Yao Zihang) #5

I’m new to libtorch and what’s the meaning of “synchronize between calls”?
Could you share the modified code? Thank you!

#6

Just add cudaDeviceSynchronize() after ‘forward()’.
(according to nvidia doc, cudaDeviceSynchronize() could be used to wait for GPU compute to finish)
However, this command has no effect on the speed of the execution,just let cpu count the elapsed time correctly.