Speed-up libtorch operators on CPU

I’ve done a speed comparison between simple operators in libtorch and simple vectors. I don’t know entirely whether the comparison is fair, but the vector method seems to be about 60 times faster. Is there a way to improve speed on simple operators? I want to know whether it would make sense to convert back and forth from tensor to vector and back during inference on the CPU.

#include <torch/torch.h>
    Tensor X = torch::randn({ 253 });
    Tensor output = X;

    std::chrono::time_point<std::chrono::system_clock> time = chrono::system_clock::now();
        torch::NoGradGuard no_grad;
        for (int i = 0; i < 10000000; i++) {
            output = X * X;

    cout<<chrono::duration_cast<chrono::milliseconds>(chrono::system_clock::now() - time).count()/1000.<<endl;
    //Convert tensor to vector
    std::vector<float> v(X.data_ptr<float>(), X.data_ptr<float>() + X.numel());
    vector<float> results = v;

    time = chrono::system_clock::now();
    for(int i=0;i<10000000;i++){

        std::transform(v.cbegin(), v.cend(), v.cbegin(), results.begin(), std::multiplies<float>());
    cout<<chrono::duration_cast<chrono::milliseconds>(chrono::system_clock::now() - time).count()/1000.<<endl;

Libtorch: 7.125 seconds
Standard vector: 0.282 seconds

Which compiler are you using? Are you sure that loop isn’t being optimized out? I see the following on my system:


if you have loops with 100+ iterations in practice, then yes, you should be using data_ptr<T> for “unboxing”. vector is not the best choice here, instead you can either work with pointers or use something like Eigen arrays (mapped).

and I forgot about accessors. this is from docs:

torch::Tensor foo = torch::rand({12, 12});

// assert foo is 2-dimensional and holds floats.
auto foo_a = foo.accessor<float,2>();
float trace = 0;

for(int i = 0; i < foo_a.size(0); i++) {
  // use the accessor foo_a to get tensor data.
  trace += foo_a[i][i];

/usr/bin/c++ --version gives me
c++ (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0

To help you train the faster , here are 8 tips you should be aware of that … data from GPU to CPU and dramatically slows your performance. As you can see, the models using NNAPI run about 25-30% faster for both Float32 and Int8 compared with the CPU models.

I’m not transferring any data from GPU to CPU, I’m just using the CPU. Working with many tiny networks (32 nodes max) in parallel

The actual problem isn’t a loop of matrix operations, I just used this to time the operation. I don’t want to use the vector implementation because then I’d have to write backpropagation myself. My networks are actually very small consisting of just 4 basic vector operations (+,-,x,/). Everything is run on the CPU.

but the big absolute overhead you’ve observed is from repeated Tensor unwrappings in the loop. If your neural network iteration consists of 100-1000 operations, and tensor sizes are bigger (from batching), this overhead is not so significant.

for long sequences of piecewise operations, JIT compiling does a decent job fusing them (it used to need USE_CPU_FUSER=1 env. variable, haven’t tried it recently)