I’ve done a speed comparison between simple operators in libtorch and simple vectors. I don’t know entirely whether the comparison is fair, but the vector method seems to be about 60 times faster. Is there a way to improve speed on simple operators? I want to know whether it would make sense to convert back and forth from tensor to vector and back during inference on the CPU.
#include <torch/torch.h>
Tensor X = torch::randn({ 253 });
Tensor output = X;
std::chrono::time_point<std::chrono::system_clock> time = chrono::system_clock::now();
{
torch::NoGradGuard no_grad;
for (int i = 0; i < 10000000; i++) {
output = X * X;
}
}
cout<<chrono::duration_cast<chrono::milliseconds>(chrono::system_clock::now() - time).count()/1000.<<endl;
//Convert tensor to vector
std::vector<float> v(X.data_ptr<float>(), X.data_ptr<float>() + X.numel());
vector<float> results = v;
time = chrono::system_clock::now();
for(int i=0;i<10000000;i++){
std::transform(v.cbegin(), v.cend(), v.cbegin(), results.begin(), std::multiplies<float>());
}
cout<<chrono::duration_cast<chrono::milliseconds>(chrono::system_clock::now() - time).count()/1000.<<endl;
Libtorch: 7.125 seconds
Standard vector: 0.282 seconds
if you have loops with 100+ iterations in practice, then yes, you should be using data_ptr<T> for “unboxing”. vector is not the best choice here, instead you can either work with pointers or use something like Eigen arrays (mapped).
torch::Tensor foo = torch::rand({12, 12});
// assert foo is 2-dimensional and holds floats.
auto foo_a = foo.accessor<float,2>();
float trace = 0;
for(int i = 0; i < foo_a.size(0); i++) {
// use the accessor foo_a to get tensor data.
trace += foo_a[i][i];
}
To help you train the faster , here are 8 tips you should be aware of that … data from GPU to CPU and dramatically slows your performance. As you can see, the models using NNAPI run about 25-30% faster for both Float32 and Int8 compared with the CPU models.
The actual problem isn’t a loop of matrix operations, I just used this to time the operation. I don’t want to use the vector implementation because then I’d have to write backpropagation myself. My networks are actually very small consisting of just 4 basic vector operations (+,-,x,/). Everything is run on the CPU.
but the big absolute overhead you’ve observed is from repeated Tensor unwrappings in the loop. If your neural network iteration consists of 100-1000 operations, and tensor sizes are bigger (from batching), this overhead is not so significant.
for long sequences of piecewise operations, JIT compiling does a decent job fusing them (it used to need USE_CPU_FUSER=1 env. variable, haven’t tried it recently)