Hi, I’m playing a bit with pytorch and noticed that my pytorch code is four times slower, compared to an equivalent tensorflow code.
Running on CPU: pytorch 0.3.0 time: 265s, tensorflow time: 77s.
UPD1: If I use pytorch 0.4.0 compiled from source, I get 218s, which is still ~2.8 times slower.
UPD2: I got a very helpful feedback on reddit. The reason is that my LSTM is unusually small, and pytorch was never optimized for such a non-typical case. Indeed, increasing number of inputs to 300, makes performance quite comparable (pytorch is only 20% slower).
Note that I use generic tensorflow which unlike compiled pytorch does not support some of my CPU instructions (SSE4.1 SSE4.2 AVX).
The code looks equivalent, 32bit floats are used in both cases.
Here is how I’m doing it with pytorch:
#!/usr/bin/env python3
import torch
import torch.nn as nn
import torch.autograd as autograd
n_iter = 1000
n_layers = 2
batch_size = 32
seq_len = 1000
input_dim = 7
x = autograd.Variable(torch.rand(batch_size, seq_len, input_dim))
lstm = nn.LSTM(input_dim, input_dim, n_layers, batch_first=True)
for _ in range(n_iter):
lstm(x)
And here is the tensorflow code:
#!/usr/bin/env python3
import numpy as np
import tensorflow as tf
# use 1 CPU
conf=tf.ConfigProto(
intra_op_parallelism_threads=1,
inter_op_parallelism_threads=1)
n_iter = 1000
n_layers = 2
batch_size = 32
seq_len = 1000
input_dim = 7
data = np.random.uniform(size=(batch_size, seq_len, input_dim))
x = tf.placeholder(tf.float32, shape=(batch_size, seq_len, input_dim))
cells = [tf.contrib.rnn.LSTMCell(input_dim) for _ in range(n_layers)]
multicell = tf.contrib.rnn.MultiRNNCell(cells)
rnn_outputs, final_state = tf.nn.dynamic_rnn(multicell, x, dtype=tf.float32)
init = tf.global_variables_initializer()
with tf.Session(config=conf) as sess:
sess.run(init)
for _ in range(n_iter):
sess.run(rnn_outputs, {x: data})
Tensorflow v. 1.4.1 (generic, not recompiled version), pytorch 0.3.0.post4.