Is LSTM in pytorch ~3 times slower compared to tensorflow or am I doing it wrong?

Hi, I’m playing a bit with pytorch and noticed that my pytorch code is four times slower, compared to an equivalent tensorflow code.

Running on CPU: pytorch 0.3.0 time: 265s, tensorflow time: 77s.

UPD1: If I use pytorch 0.4.0 compiled from source, I get 218s, which is still ~2.8 times slower.

UPD2: I got a very helpful feedback on reddit. The reason is that my LSTM is unusually small, and pytorch was never optimized for such a non-typical case. Indeed, increasing number of inputs to 300, makes performance quite comparable (pytorch is only 20% slower).

Note that I use generic tensorflow which unlike compiled pytorch does not support some of my CPU instructions (SSE4.1 SSE4.2 AVX).

The code looks equivalent, 32bit floats are used in both cases.

Here is how I’m doing it with pytorch:

#!/usr/bin/env python3
import torch
import torch.nn as nn
import torch.autograd as autograd


n_iter = 1000

n_layers = 2

batch_size = 32
seq_len = 1000
input_dim = 7
x = autograd.Variable(torch.rand(batch_size, seq_len, input_dim))

lstm = nn.LSTM(input_dim, input_dim, n_layers, batch_first=True)

for _ in range(n_iter):
    lstm(x)

And here is the tensorflow code:

#!/usr/bin/env python3
import numpy as np
import tensorflow as tf


# use 1 CPU
conf=tf.ConfigProto(
    intra_op_parallelism_threads=1,
    inter_op_parallelism_threads=1)

n_iter = 1000

n_layers = 2

batch_size = 32
seq_len = 1000
input_dim = 7
data = np.random.uniform(size=(batch_size, seq_len, input_dim))

x = tf.placeholder(tf.float32, shape=(batch_size, seq_len, input_dim))

cells = [tf.contrib.rnn.LSTMCell(input_dim) for _ in range(n_layers)]
multicell = tf.contrib.rnn.MultiRNNCell(cells)
rnn_outputs, final_state = tf.nn.dynamic_rnn(multicell, x, dtype=tf.float32)

init = tf.global_variables_initializer()

with tf.Session(config=conf) as sess:
    sess.run(init)
    for _ in range(n_iter):
        sess.run(rnn_outputs, {x: data})

Tensorflow v. 1.4.1 (generic, not recompiled version), pytorch 0.3.0.post4.

I would try compiling pytorch 0.4 from source, in my experiments it is significantly faster. There is also a CPU optimized library called NNPACK that pytorch can use, if compiled with NNPACK support it should run faster. Finally, python 2 is known to be faster than python 3.

also, there is pytorch JIT compiler that is still experimental in pytorch 0.4, but it works for me - it compiles the python code into C code on the fly with very little changes to your source code. It should also work much faster.

Hi, thanks for your suggestions.

Recompiling from source improved the speed a bit, however it’s still 2.8 times slower (note that tensorflow was not compiled from source, and does not support some CPU flags).

Regarding NNPACK, there is no documentation on how to compile with NNPACK support.

So far I’m trying just to provide NNPACK_ROOT_DIR environment variable, but it does not work. I saw your question, did you manage to compile it?

Regarding JIT compiler, I do not think it will help, since python code is not the bottleneck here.

I think they mostly tried to optimize the framework for the CUDA GPUs for now. Even though I have access to many fast GPUs, I sometimes like to prototype/debug things on my laptop without a GPU, so I also needed a faster CPU compute. I did find a way to install the NNPACK properly, I just followed the steps here. However the current code still has unresolved build issues with it, I submitted one today on the github.

it now builds with the NNPACK for me