How to speed up for loop in customized RNN?

Hi, there, I am working on a new RNN unit implementation. Since the formulation is totally different with existing RNN units, I implemented everything from scratch. In order to process information in each time stamp, I used a for loop to loop through time stamps. It looks like the codes below. Unfortunately, it is much slower then its theano counterpart. I am wondering is there a special way to do the for loop in RNNs which can be faster than the naive way. So, in a more general case, is there any special way to speed up general for loop process in pytorch?

class RNN(nn.Module):
    def __init__(self, input_size, hidden_size, output_size):
        super(RNN, self).__init__()
        self.hidden_size = hidden_size
        self.i2h = nn.Linear(input_size+hidden_size, hidden_size)
        self.h2o = nn.Linear(input_size+hidden_size, output_size)
        self.tanh = nn.Tanh()

    def forward(self, X):
        time_steps = X.size(0)
        batch_size = X.size(1)
        hidden = Variable(torch.zeros(batch_size, self.hidden_size))
        outputs = []
        hiddens = []
        for t in range(time_steps):
            x_input = X[t]
            hidden_input = hidden
            inp = (x_input,hidden_input), 1 )
            hidden = self.tanh(self.i2h(inp))
            output = self.h2o(inp)
        return, 1),, 1)

I’m afraid there’s not a lot you can do at the moment. We know what issues are slowing down small RNN cells and will be fixing them once we finish the autograd refactor.


One thing I noticed is that if you implement an RNN with complex control flow it’s actually faster to run most of it on the CPU and only offload big matrix operations to the GPU. It’s not surprising perhaps, and it applies to all frameworks.

Not saying that this is the case here, just a sidenote.


Yes, that’s because of the kernel launch latency. Dispatching the nonlinearities is often slower than computing them.

1 Like

@Veril Thanks for the note! I will give it a shot on my model.

@apaszke I see. Looking forward to the fix. :+1:

@apaszke I’m wondering is there any temporary trick to speed it up?

Removing the biases might help a bit. I now looked at the cell once again and I don’t think it should be slow. I think I’ll need to take a look at it in the profiler.

What input_size, hidden_size and output_size are you using?

The input is TxBxD. T is the time steps (usually equals 4096). B is the batch size and I usually set it to 2. D is the dimension of features (which is usually 256). The dimension of hidden states is 128 and output feature dimension is 128 too.

Following this thread because I’m also interested in doing modifications to the original LSTM and benchmarked with my results on reddit. For example:

  1. Layer normalization
  2. Multiplicative Integration
  3. Additive Residual Connections between Layers

Is there anyway to maintain the speed that the cudnn LSTM has while adding these new features in? I don’t know if autograd has been refactored yet.


cuDNN is very well optimized low-level CUDA C and/or assembly code that is specific to exactly the LSTM variant that they implement, and no framework can maintain that same speed while allowing flexibility to modify it in the ways you describe. But PyTorch will continue to work on optimization of use cases like this, and while right now the speed loss will probably be somewhere between 2x and 5x, it should get better over time.

1 Like

Hello, I am also trying to implement RNN from scratch, LSTM to be specific, I am using two for loops, one for sequence time steps and other for layers, I am not sure how will autograd work on this. Can you confirm that it backprops through all time steps and layers? Also, instead of matrix multiplication in LSTM I am using convolutions, and I cannot see past the conv function in the loss.backward() as it has no attribute --> previous_functions?. Please guide.

Thank you,

Any updates on improving speed for customized RNN? I am also working with non traditional RNN so I can not use predefined cuDNN cells. I converted some of my old lua-torch codes to pytorch and they are 3 times slower.

To give an example, I implemented from scratch an LSTM cell (see below) both in pytorch and lua-torch (using nngraph and cunn) and I ran it forward and backward 1000 times with fake data. The computational times on a Titan X are:
pytorch: 4.6s
lua-torch: 1.4s

I really enjoy pytorch so I hope something can be done about it. Thanks!

class LSTMCell(nn.Module):
    def __init__(self, input_size, hidden_size):
        super(LSTMCell, self).__init__()
        self.lin = nn.Linear( input_size+hidden_size , 4*hidden_size )
    def forward(self, x, state0):
        x_and_h0 =,h0), 1)
        i=F.sigmoid( u[ : , 0*self.hidden_size : 1*self.hidden_size ] )
        f=F.sigmoid( u[ : , 1*self.hidden_size : 2*self.hidden_size ] )
        g=F.tanh(    u[ : , 2*self.hidden_size : 3*self.hidden_size ] )
        o=F.sigmoid( u[ : , 3*self.hidden_size : 4*self.hidden_size ] )
        c= f*c0 + i*g
        h= o*F.tanh(c)
        return (h,c)

I got the computational time with the following parameters:

I’m not sure it is effective or not in this case but Numba or Cython may speed up the loop.

no unfortunately it is not a problem with the loop. I run the cell 1000 times and it take 4.6 sec in total. Looking at the details of each run, each of them takes in average, 0.0045 sec. So the loop account for very little. Thanks though!

Profiling the cell I find that only 20% of the computational time is spent on the matrix multiplication. On the other hand 70% is spent on tanh, sigmoid, add, mul, cat and slice (roughly equally divided between each of them). Does it come from the kernel launch latency mentioned earlier in the conversation by @apaszke? If yes is there any future plans/hope to improve on this? Thanks.

The difference between pytorch and lua torch is probably due to autograd overhead (it is ~10us per operation in pytorch). There are plans to reduce autograd overhead, this should help bring pytorch time closer to lua torch (see As for speeding up pointwise operations, fuser to do this in on the roadmap. In the meantime the best solution is probably to have your custom kernels using cupy, like in pyinn project You’d also need to hand-code your backward pass in this case.

1 Like

Thanks for the detailed answer! Things make much more sense.

Any updates at this moment?