Speed-up libtorch operators on CPU

Joe_Harrison · October 1, 2021, 8:43am

I’ve done a speed comparison between simple operators in libtorch and simple vectors. I don’t know entirely whether the comparison is fair, but the vector method seems to be about 60 times faster. Is there a way to improve speed on simple operators? I want to know whether it would make sense to convert back and forth from tensor to vector and back during inference on the CPU.

#include <torch/torch.h>
    Tensor X = torch::randn({ 253 });
    Tensor output = X;

    std::chrono::time_point<std::chrono::system_clock> time = chrono::system_clock::now();
    {
        torch::NoGradGuard no_grad;
        for (int i = 0; i < 10000000; i++) {
            output = X * X;
        }
    }

    cout<<chrono::duration_cast<chrono::milliseconds>(chrono::system_clock::now() - time).count()/1000.<<endl;
    //Convert tensor to vector
    std::vector<float> v(X.data_ptr<float>(), X.data_ptr<float>() + X.numel());
    vector<float> results = v;

    time = chrono::system_clock::now();
    for(int i=0;i<10000000;i++){

        std::transform(v.cbegin(), v.cend(), v.cbegin(), results.begin(), std::multiplies<float>());
    }
    cout<<chrono::duration_cast<chrono::milliseconds>(chrono::system_clock::now() - time).count()/1000.<<endl;

Libtorch: 7.125 seconds
Standard vector: 0.282 seconds

ryanleary · October 6, 2021, 2:42pm

Which compiler are you using? Are you sure that loop isn’t being optimized out? I see the following on my system:

15.365
51.345

googlebot · October 6, 2021, 3:15pm

if you have loops with 100+ iterations in practice, then yes, you should be using data_ptr<T> for “unboxing”. vector is not the best choice here, instead you can either work with pointers or use something like Eigen arrays (mapped).

googlebot · October 6, 2021, 3:21pm

and I forgot about accessors. this is from docs:

torch::Tensor foo = torch::rand({12, 12});

// assert foo is 2-dimensional and holds floats.
auto foo_a = foo.accessor<float,2>();
float trace = 0;

for(int i = 0; i < foo_a.size(0); i++) {
  // use the accessor foo_a to get tensor data.
  trace += foo_a[i][i];
}

Joe_Harrison · October 7, 2021, 5:52am

/usr/bin/c++ --version gives me
c++ (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0

Eugene7 · October 8, 2021, 8:06am

To help you train the faster , here are 8 tips you should be aware of that … data from GPU to CPU and dramatically slows your performance. As you can see, the models using NNAPI run about 25-30% faster for both Float32 and Int8 compared with the CPU models.

Joe_Harrison · October 8, 2021, 8:35am

I’m not transferring any data from GPU to CPU, I’m just using the CPU. Working with many tiny networks (32 nodes max) in parallel

Joe_Harrison · October 11, 2021, 6:57am

The actual problem isn’t a loop of matrix operations, I just used this to time the operation. I don’t want to use the vector implementation because then I’d have to write backpropagation myself. My networks are actually very small consisting of just 4 basic vector operations (+,-,x,/). Everything is run on the CPU.

googlebot · October 11, 2021, 10:05am

but the big absolute overhead you’ve observed is from repeated Tensor unwrappings in the loop. If your neural network iteration consists of 100-1000 operations, and tensor sizes are bigger (from batching), this overhead is not so significant.

for long sequences of piecewise operations, JIT compiling does a decent job fusing them (it used to need USE_CPU_FUSER=1 env. variable, haven’t tried it recently)

dtsaras · April 14, 2023, 3:07am

Hey, did you end up finding out how to speed up this process?

Joe_Harrison · November 24, 2023, 8:39am

Yes, but not satisfyingly. I ended up writing my own backprop. Torch is great for larger models where the overhead makes sense