Performance on small model 2000x slower than raw C++ (on CPU)

site · June 10, 2020, 9:14pm

Hi all, I’ll preface this by saying I am brand new to PyTorch and may be missing something obvious. Thanks in advance for your time.

I’ve been testing the training time of a simple model with basic PyTorch functionality (autograd, SGD). It is about 2000x slower than a comparable training run written in raw C++ (without PyTorch). I understand that PyTorch creates and manages dynamic computational graphs, so I would expect it to be slower than raw C++, maybe by a factor of 10 or so–but I am surprised that the difference is so big. Maybe I am doing something silly in my PyTorch code that is hurting the performance. I am very curious to check that my results are correct, and to better understand why PyTorch sees this performance hit on small models.

Additional notes

I’ve also tried using the C++ PyTorch bindings and have gotten similar speed results.
I haven’t put any special effort into optimizing the C++ code.
I spent some time trying to isolate the bottleneck. The backward() call takes up the majority of the time, however even if I get rid of this we’re still two orders of magnitude off the C++ code. In particular, it seems that whenever you do an operation on a tensor, there is a large fixed cost of around 3 microseconds (regardless of the size of the tensor). For example, x = 3*t + 4*t + 5*t would take around 10 microseconds if t is a tensor that holds only 1 element.

Here’s the PyTorch script. It runs 100,000 epochs of a linear model with 11 datapoints in each epoch, and takes 18 seconds.

import torch
import time

# Training data. t_u are inputs, t_c are target outputs
t_u = [35.7, 55.9, 58.2, 81.9, 56.3, 48.9, 33.9, 21.8, 48.4, 60.4, 68.4]
t_c = [0.5, 14.0, 15.0, 28.0, 11.0, 8.0, 3.0, -4.0, 6.0, 13.0, 21.0]
t_u = torch.tensor(t_u)
t_c = torch.tensor(t_c)

def training_loop(n_epochs, learning_rate, w, b, t_u, t_c):
    for epoch in range(n_epochs):
        if w.grad is not None:
            w.grad.zero_()
        if b.grad is not None:
            b.grad.zero_()

        t_p = w*t_u + b
        loss = ((t_p - t_c)**2).mean()
        loss.backward()
        
        w = (w - learning_rate * w.grad).detach().requires_grad_()
        b = (b - learning_rate * b.grad).detach().requires_grad_()
        
    return w, b

w = torch.ones(1, requires_grad=True)
b = torch.zeros(1, requires_grad=True)

start = time.time()
# 100,000 epochs
w, b = training_loop(100000, 0.0001, w, b, t_u, t_c)
end = time.time()

print(w, b)
print(end-start, 'seconds')

Here’s the raw C++ code, which runs 100 million epochs in 9 seconds. So 1000x the amount of computation in half the time:

#include <iostream>
#include <vector>
#include <array>

int main() {
  double LEARNING_RATE = 0.0001;

  auto t_c = std::vector<double>({0.5, 14.0, 15.0, 28.0, 11.0, 8.0, 3.0, -4.0, 6.0, 13.0, 21.0});
  auto t_u = std::vector<double>({35.7, 55.9, 58.2, 81.9, 56.3, 48.9, 33.9, 21.8, 48.4, 60.4, 68.4});

  double w = 1;
  double b = 0;

  int N = t_c.size();

  // 100 million epochs
  for (auto i=0; i<100000000; ++i) {
    // Compute predictions
    auto t_p = std::vector<double>(N);
    for (int k=0; k<N; ++k) {
      t_p[k] = w*t_u[k] + b;
    }

    // Compute loss
    double loss = 0;
    for (int k=0; k<N; ++k) {
      loss += (t_c[k] - t_p[k]) * (t_c[k] - t_p[k]);
    }
    loss = loss / N;

    // Compute gradient
    double w_grad = 0;
    double b_grad = 0;
    for (int k=0; k<N; ++k) {
      w_grad += 2 * t_u[k] * (t_p[k] - t_c[k]);
      b_grad += 2 * (t_p[k] - t_c[k]);
    }
    w_grad = w_grad / N;
    b_grad = b_grad / N;

    w -= LEARNING_RATE * w_grad;
    b -= LEARNING_RATE * b_grad;
  }

  std::cout << w << " " << b << "\n";
}

tom · June 10, 2020, 10:20pm

Nice example, I hope you enjoyed the trip to some obscure location.

So what happens is that Python and PyTorch do all sorts of (dynamic) bookkeeping in between of the actual computation. Given that the computation is very trivial, this overhead is extremely large in relation. When you get to more realistic problems, the relation is quite different and the overhead will be much smaller. At that point there are then other factors determining the performance.

Best regards

Thomas