Hi all, I’ll preface this by saying I am brand new to PyTorch and may be missing something obvious. Thanks in advance for your time.

I’ve been testing the training time of a simple model with basic PyTorch functionality (autograd, SGD). It is about 2000x slower than a comparable training run written in raw C++ (without PyTorch). I understand that PyTorch creates and manages dynamic computational graphs, so I would expect it to be slower than raw C++, maybe by a factor of 10 or so–but I am surprised that the difference is so big. Maybe I am doing something silly in my PyTorch code that is hurting the performance. I am very curious to check that my results are correct, and to better understand why PyTorch sees this performance hit on small models.

Additional notes

- I’ve also tried using the C++ PyTorch bindings and have gotten similar speed results.
- I haven’t put any special effort into optimizing the C++ code.
- I spent some time trying to isolate the bottleneck. The
`backward()`

call takes up the majority of the time, however even if I get rid of this we’re still two orders of magnitude off the C++ code. In particular, it seems that whenever you do an operation on a tensor, there is a large fixed cost of around 3 microseconds (regardless of the size of the tensor). For example,`x = 3*t + 4*t + 5*t`

would take around 10 microseconds if`t`

is a tensor that holds only 1 element.

Here’s the PyTorch script. It runs 100,000 epochs of a linear model with 11 datapoints in each epoch, and takes 18 seconds.

```
import torch
import time
# Training data. t_u are inputs, t_c are target outputs
t_u = [35.7, 55.9, 58.2, 81.9, 56.3, 48.9, 33.9, 21.8, 48.4, 60.4, 68.4]
t_c = [0.5, 14.0, 15.0, 28.0, 11.0, 8.0, 3.0, -4.0, 6.0, 13.0, 21.0]
t_u = torch.tensor(t_u)
t_c = torch.tensor(t_c)
def training_loop(n_epochs, learning_rate, w, b, t_u, t_c):
for epoch in range(n_epochs):
if w.grad is not None:
w.grad.zero_()
if b.grad is not None:
b.grad.zero_()
t_p = w*t_u + b
loss = ((t_p - t_c)**2).mean()
loss.backward()
w = (w - learning_rate * w.grad).detach().requires_grad_()
b = (b - learning_rate * b.grad).detach().requires_grad_()
return w, b
w = torch.ones(1, requires_grad=True)
b = torch.zeros(1, requires_grad=True)
start = time.time()
# 100,000 epochs
w, b = training_loop(100000, 0.0001, w, b, t_u, t_c)
end = time.time()
print(w, b)
print(end-start, 'seconds')
```

Here’s the raw C++ code, which runs 100 *million* epochs in 9 seconds. So 1000x the amount of computation in half the time:

```
#include <iostream>
#include <vector>
#include <array>
int main() {
double LEARNING_RATE = 0.0001;
auto t_c = std::vector<double>({0.5, 14.0, 15.0, 28.0, 11.0, 8.0, 3.0, -4.0, 6.0, 13.0, 21.0});
auto t_u = std::vector<double>({35.7, 55.9, 58.2, 81.9, 56.3, 48.9, 33.9, 21.8, 48.4, 60.4, 68.4});
double w = 1;
double b = 0;
int N = t_c.size();
// 100 million epochs
for (auto i=0; i<100000000; ++i) {
// Compute predictions
auto t_p = std::vector<double>(N);
for (int k=0; k<N; ++k) {
t_p[k] = w*t_u[k] + b;
}
// Compute loss
double loss = 0;
for (int k=0; k<N; ++k) {
loss += (t_c[k] - t_p[k]) * (t_c[k] - t_p[k]);
}
loss = loss / N;
// Compute gradient
double w_grad = 0;
double b_grad = 0;
for (int k=0; k<N; ++k) {
w_grad += 2 * t_u[k] * (t_p[k] - t_c[k]);
b_grad += 2 * (t_p[k] - t_c[k]);
}
w_grad = w_grad / N;
b_grad = b_grad / N;
w -= LEARNING_RATE * w_grad;
b -= LEARNING_RATE * b_grad;
}
std::cout << w << " " << b << "\n";
}
```