What is num_layers in RNN module?

Hi, I am not sure about num_layers in RNN module. To be clarify, could you check whether my understanding is right or not. I uploaded an image when num_layers==2. In my understanding, num_layers is similar to CNN’s out_channels. It is just a RNN layer with different filters (So we can train different weights variable for outputting h ). Right?

I am probably right…

class TestLSTM(nn.Module):
    def __init__(self, input_size, hidden_size, num_layers):
        super(TestLSTM, self).__init__()
        self.rnn = nn.LSTM(input_size, hidden_size, num_layers, batch_first=False)
    def forward(self, x, h, c):
        out = self.rnn(x, (h, c))
        return out

bs = 10
seq_len = 7
input_size = 28
hidden_size = 50
num_layers = 2

test_lstm = TestLSTM(input_size, hidden_size, num_layers)

input = Variable(torch.randn(seq_len, bs, input_size))
h0 = Variable(torch.randn(num_layers, bs, hidden_size))
c0 = Variable(torch.randn(num_layers, bs, hidden_size))
output, h = test_lstm(input, h0, c0)
print('output', output.size())
print('h and c', h[0].size(), h[1].size())

TestLSTM (
  (rnn): LSTM(28, 50, num_layers=2)
output torch.Size([7, 10, 50])
h and c torch.Size([2, 10, 50]) torch.Size([2, 10, 50])

No, your understanding is wrong. num_layers in RNN is just stacking RNNs on top of each other. So you get a hidden from each layer and an output only from the topmost layer.


I found a nice image. Does this mean num_layers==2? And we can get last hidden. Right?


Yes!! That is correct.


Thank your for your help!

I have two questions:

  1. Consider:
    self.lstm1 = nn.LSTM(input_dim, hidden_dim, num_layers=1)
    self.lstm2 = nn.LSTM(input_dim, hidden_dim, num_layers=2)

    Why are the weights the same values? Are the weights reused?lstm1.weight_ih_l0.size() == lstm2.weight_ih_l0.size()

  1. self.lstm1a = nn.LSTM(input_dim, hidden_dim, num_layers=1)
    self.lstm1b = nn.LSTM(hidden_dim, hidden_dim, num_layers=1)
    self.lstm2= nn.LSTM(input_dim, hidden_dim, num_layers=2)

    y2 = self.lstm2(x, …)
    y1 = self.lstm1b(self.lstm1a(x, …),…)

    Are y1,y2 the same thing?

  1. They are definitely not same values:
>>> lstm1 = nn.LSTM(input_dim, hidden_dim, num_layers=1)
>>> lstm2 = nn.LSTM(input_dim, hidden_dim, num_layers=2)
>>> lstm1.weight_ih_l0
Parameter containing:
-0.3027 -0.2689 -0.3551
 0.5509  0.1728  0.0360
-0.1964  0.1770  0.2209
-0.4915  0.3696  0.5712
 0.2401  0.0593 -0.4117
 0.4066  0.3684  0.3482
 0.2870 -0.0531  0.1953
 0.0928 -0.4165  0.5613
-0.4697  0.4112  0.1346
 0.3438 -0.1885  0.5242
 0.3756  0.2288  0.2949
-0.1401  0.0173 -0.0247
[torch.FloatTensor of size 12x3]

>>> lstm2.weight_ih_l0
Parameter containing:
-0.3672 -0.0299  0.1597
 0.0828 -0.2755  0.4451
 0.1861  0.1213 -0.5596
-0.2776 -0.4791 -0.2322
-0.5063  0.0437  0.1145
-0.2652 -0.0932  0.0865
-0.3323  0.4274 -0.3038
-0.1449 -0.1430  0.5393
 0.5589  0.1293 -0.5174
-0.4502  0.5351  0.2430
-0.5448 -0.4007 -0.2560
 0.5424 -0.1821 -0.0779
[torch.FloatTensor of size 12x3]
  1. No, they are computed by different LSTMs with different parameters. They are different.

@SimonW @smth
Hello guys, I’d like to ask you one thing about this parameter (num_layers in RNN module) and how it relates to the LSTM stable documentation.
Looking at the picture posted above, I’d say that the hidden state at time t of the first hidden layer receives as input the hidden state at time (t-1) of the same layer. Similarly, the hidden state at time t of the second layer receives as input the hiddent state at time (t-1) of the second layer.
Yet, in the nn.LSTM doc (https://pytorch.org/docs/stable/nn.html#torch.nn.LSTM) there is:
“h(t−1) is the hidden state of the previous layer at time t-1”
Considering that the gates receive h(t-1) as input, does this mean that the l-th layer should look at the (l-1)-th layer? Or am I reading it wrong?

That’s a bug in the documentation, your interpretation of the picture is right.
The t-dimension stays in the same layer. The connection between the layers is that the output of the l-1st layer is the input of the lth layer, possibly multiplied by drop, i.e. h^(l-1)(t) delta^(l-1)(t) = i^(l)(t).

Best regards


Hi @FAlex,

thanks for pointing out the potential for improvement in the documentation!
I’ve put this into a PR on github, so hopefully PyTorch 1.0 ships with clearer documentation.

Best regards



That’s awesome, thanks!

If num_layers==3, will the output characters on the top be the same?