I was seeing the LSTM and was wondering why the hidden = (h_n,c_n)
was (num_layers * num_directions, batch, hidden_size)
. In particular why is it not:
(num_layers * num_directions, seq_len, batch, hidden_size)
I was seeing the LSTM and was wondering why the hidden = (h_n,c_n)
was (num_layers * num_directions, batch, hidden_size)
. In particular why is it not:
(num_layers * num_directions, seq_len, batch, hidden_size)
Oh I realized that its only the first RNN block that needs a hidden layer…so for each sequence in the batch we need a hidden state. So how many initial hidden states do we need? One for each sequence batch and one for each direction and one for each each layer that we have. Which gives num_layers * num_directions * batch
but each has hidden_size
dimension. So we get:
(num_layers * num_directions, batch, hidden_size)