How to speed up for loop in customized RNN?

qianguih · March 12, 2017, 10:45am

Hi, there, I am working on a new RNN unit implementation. Since the formulation is totally different with existing RNN units, I implemented everything from scratch. In order to process information in each time stamp, I used a for loop to loop through time stamps. It looks like the codes below. Unfortunately, it is much slower then its theano counterpart. I am wondering is there a special way to do the for loop in RNNs which can be faster than the naive way. So, in a more general case, is there any special way to speed up general for loop process in pytorch?

class RNN(nn.Module):
    def __init__(self, input_size, hidden_size, output_size):
        super(RNN, self).__init__()
        self.hidden_size = hidden_size
        
        self.i2h = nn.Linear(input_size+hidden_size, hidden_size)
        self.h2o = nn.Linear(input_size+hidden_size, output_size)
        self.tanh = nn.Tanh()

    def forward(self, X):
        time_steps = X.size(0)
        batch_size = X.size(1)
        hidden = Variable(torch.zeros(batch_size, self.hidden_size))
        outputs = []
        hiddens = []
        for t in range(time_steps):
            x_input = X[t]
            hidden_input = hidden
            inp = torch.cat( (x_input,hidden_input), 1 )
            hidden = self.tanh(self.i2h(inp))
            output = self.h2o(inp)
            outputs.append(output)
            hidden.append(hidden)
        return torch.cat(hiddens, 1), torch.cat(outputs, 1)

apaszke · March 12, 2017, 11:14am

I’m afraid there’s not a lot you can do at the moment. We know what issues are slowing down small RNN cells and will be fixing them once we finish the autograd refactor.

Veril · March 12, 2017, 1:06pm

One thing I noticed is that if you implement an RNN with complex control flow it’s actually faster to run most of it on the CPU and only offload big matrix operations to the GPU. It’s not surprising perhaps, and it applies to all frameworks.

Not saying that this is the case here, just a sidenote.

apaszke · March 12, 2017, 1:23pm

Yes, that’s because of the kernel launch latency. Dispatching the nonlinearities is often slower than computing them.

qianguih · March 12, 2017, 7:58pm

@Veril Thanks for the note! I will give it a shot on my model.

@apaszke I see. Looking forward to the fix.

qianguih · March 13, 2017, 7:49am

@apaszke I’m wondering is there any temporary trick to speed it up?

apaszke · March 13, 2017, 8:15am

Removing the biases might help a bit. I now looked at the cell once again and I don’t think it should be slow. I think I’ll need to take a look at it in the profiler.

apaszke · March 13, 2017, 8:22am

What input_size, hidden_size and output_size are you using?

qianguih · March 15, 2017, 10:52pm

The input is TxBxD. T is the time steps (usually equals 4096). B is the batch size and I usually set it to 2. D is the dimension of features (which is usually 256). The dimension of hidden states is 128 and output feature dimension is 128 too.

NickShahML · April 22, 2017, 4:25pm

Following this thread because I’m also interested in doing modifications to the original LSTM and benchmarked with my results on reddit. For example:

Layer normalization
Multiplicative Integration
Additive Residual Connections between Layers

Is there anyway to maintain the speed that the cudnn LSTM has while adding these new features in? I don’t know if autograd has been refactored yet.

jekbradbury · April 22, 2017, 11:20pm

cuDNN is very well optimized low-level CUDA C and/or assembly code that is specific to exactly the LSTM variant that they implement, and no framework can maintain that same speed while allowing flexibility to modify it in the ways you describe. But PyTorch will continue to work on optimization of use cases like this, and while right now the speed loss will probably be somewhere between 2x and 5x, it should get better over time.

pjavia · August 1, 2017, 9:50pm

Hello, I am also trying to implement RNN from scratch, LSTM to be specific, I am using two for loops, one for sequence time steps and other for layers, I am not sure how will autograd work on this. Can you confirm that it backprops through all time steps and layers? Also, instead of matrix multiplication in LSTM I am using convolutions, and I cannot see past the conv function in the loss.backward() as it has no attribute --> previous_functions?. Please guide.

Thank you,

tlaurent · September 8, 2017, 1:29am

Any updates on improving speed for customized RNN? I am also working with non traditional RNN so I can not use predefined cuDNN cells. I converted some of my old lua-torch codes to pytorch and they are 3 times slower.

To give an example, I implemented from scratch an LSTM cell (see below) both in pytorch and lua-torch (using nngraph and cunn) and I ran it forward and backward 1000 times with fake data. The computational times on a Titan X are:
pytorch: 4.6s
lua-torch: 1.4s

I really enjoy pytorch so I hope something can be done about it. Thanks!

class LSTMCell(nn.Module):
    
    def __init__(self, input_size, hidden_size):
        super(LSTMCell, self).__init__()
        self.hidden_size=hidden_size
        self.lin = nn.Linear( input_size+hidden_size , 4*hidden_size )
         
    def forward(self, x, state0):
        h0,c0=state0
        x_and_h0 = torch.cat((x,h0), 1)
        u=self.lin(x_and_h0)
        i=F.sigmoid( u[ : , 0*self.hidden_size : 1*self.hidden_size ] )
        f=F.sigmoid( u[ : , 1*self.hidden_size : 2*self.hidden_size ] )
        g=F.tanh(    u[ : , 2*self.hidden_size : 3*self.hidden_size ] )
        o=F.sigmoid( u[ : , 3*self.hidden_size : 4*self.hidden_size ] )
        c= f*c0 + i*g
        h= o*F.tanh(c)
        return (h,c)

I got the computational time with the following parameters:
input_size=500
hidden_size=500
batch_size=20

moskomule · September 8, 2017, 1:47am

I’m not sure it is effective or not in this case but Numba or Cython may speed up the loop.

tlaurent · September 8, 2017, 2:00am

no unfortunately it is not a problem with the loop. I run the cell 1000 times and it take 4.6 sec in total. Looking at the details of each run, each of them takes in average, 0.0045 sec. So the loop account for very little. Thanks though!

tlaurent · September 9, 2017, 12:20am

Profiling the cell I find that only 20% of the computational time is spent on the matrix multiplication. On the other hand 70% is spent on tanh, sigmoid, add, mul, cat and slice (roughly equally divided between each of them). Does it come from the kernel launch latency mentioned earlier in the conversation by @apaszke? If yes is there any future plans/hope to improve on this? Thanks.

ngimel · September 9, 2017, 12:35am

The difference between pytorch and lua torch is probably due to autograd overhead (it is ~10us per operation in pytorch). There are plans to reduce autograd overhead, this should help bring pytorch time closer to lua torch (see https://github.com/pytorch/pytorch/issues/2518#issuecomment-327835296. As for speeding up pointwise operations, fuser to do this in on the roadmap. In the meantime the best solution is probably to have your custom kernels using cupy, like in pyinn project https://github.com/szagoruyko/pyinn/blob/master/pyinn/conv2d_depthwise.py. You’d also need to hand-code your backward pass in this case.

tlaurent · September 9, 2017, 12:45am

Thanks for the detailed answer! Things make much more sense.

vBaiCai · September 9, 2020, 8:48am

Any updates at this moment?

ABBCAC · March 14, 2024, 2:38pm

Any update on this topic?