Clarify difference between output and h_n in RNN?

Hey,

I am trying to design an RNN-based model that processes time series data. In principle, I don’t understand the exact usages of output and h_n.

In my understanding, output is for instance a matrix (batch_size,L,Hout), where L = 0,…,t (sort of) and Hout is the number of output features (i.e. number of entries in the hidden state vector. Therefore, output is like a matrix containing [h_0|h_1|…|h_L]. Thus, if we wanted to have just the last / most recent hidden state, we would simply slice the last column h_L of that matrix?
And then there is h_n, which, as I understand, is exactly this last column of the matrix? Is this returned for convenience? Or do I misunderstand the whole thing?

Furthermore, if we wanted to learn the output of the rnn cell with a different dimension than the hidden state has, we would put for instance a fully connected layer with dimension (Hout,O) “on top” of the last state of the rnn, where O is the intended size of the output? Then, if we wanted to return an entire sequence, we would apply the learned hidden-to-output function with weight matrix dimension (Hout,O) to as many past hidden states as the desired prediction sequence length?

Hope these formulations are clear, I am still just finding my way around the theory.

Best, JZ

Hello, @jayz!

I’m not a specialist in RNNs too and also got confused when learning them.
But I think I can help you in this case :slight_smile:

Answering you questions:

Q1: if we wanted to have just the last / most recent hidden state, we would simply slice the last column h_L of that matrix?
Q2: there is h_n, which, as I understand, is exactly this last column of the matrix?
A: For both questions, the answer is yes. For RNNs and GRUs it’s right.

Q3: do I misunderstand the whole thing?
A: No, you undestood it correctly, but it’s correct for the classic RNN and for GRU. For them, the output is also used as the hidden state.
What maybe you didn’t learn yet or are not considering is that there is another type of RNN called LSTM (Long-Short Term Memory). This architecture actually computes two values, one for the output and other for hidden state.
If I’m not wrong, the classic RNN was first developed, then LSTM and finally GRU is a simplification of the LSTM.

Q4: Is this returned for convenience?
A: Yes, returned for convenience, considering different types of RNNs (classic, LSTM or GRU).
Now that you know that LSTM computes two different values for output and hidden state, Pytorch simply returns h_n in all RNNs implementations so that you can freely exchange the type of RNN without needing to change your whole code.

Q5: Furthermore, if we wanted to learn the output of the rnn cell with a different dimension than the hidden state has, we would put for instance a fully connected layer with dimension (Hout,O) “on top” of the last state of the rnn, where O is the intended size of the output? Then, if we wanted to return an entire sequence, we would apply the learned hidden-to-output function with weight matrix dimension (Hout,O) to as many past hidden states as the desired prediction sequence length?
A: In the classic RNN, it’s not possible to use different dimensions for output e hidden state, exactly because they are the same. To set different dimensions for output and hidden state, you have to use another architecture, just like LSTM.

Finally, here is an image with the details of the classic RNN, LSTM and GRU:
RNN_LSTM_GRU

Hope that it helps you! :blush:

Best regards,
Rafael Macedo.

Hey Rafael,

first of all, thanks for the very complete response, I appreciate it.
A few comments:

“Pytorch simply returns h_n in all RNNs implementations so that you can freely exchange the type of RNN without needing to change your whole code.”
>> makes total sense to me now, in terms of code compatibility. Hadn’t thought of this reason, thanks for clarifying.

“In the classic RNN, it’s not possible to use different dimensions for output e hidden state, exactly because they are the same.”
>> Ok, so lets think of a classifier which outputs at the end a softmax activation of size O, in order to classify an input to either of O classes. Then, would one typically just set Hout = O, in order to directly receive a vector of the right size, or would one for instance insert an intermittent fully connected layer that transforms Hout to O (via weight matrix Hout x O). Maybe this is also an experimental design choice and there is no typical solution? The reason why I am asking is that, similarly, in CNN, we also typically add a fully connected layer on top of the final conv layer, in order to transform the output of the conv layer to a vector of the right size (while instead we could also just use another conv layer to do that).

Concerning the differentiation between hidden state and output, I probably got a bit confused by this depiction (from the excellent book deep learning by goodfellow, fig. 10.3):

with corresponding equations:

image

As you can see the output o is constructed from h via a separate weight matrix V here. I guess this is for generality, in order to show that the output can be transformed to any size. But you are saying that, in principle, if we choose the right size of the hidden state, we could simply skip the matrix transformation V (i.e. remove the “o”-nodes), right?

Thanks,
Best, JZ

Hi again, JZ!

I’m happy to help :slight_smile:
But as I mentioned before, I’m not an RNN specialist, what means my kwoledge is limited in the Deep Learning sub-area. So I apologize in advance if there is any mistake.

Now, just asnwering your last question. With the additional info you gave me, I could realize what you were talking about.

Actualy you can do it. The book described it as an fixed part of the RNN. But as much as I know, the most used DL frameworks (a.k.a Pytorch and Tensorflow) don’t implement this transformation, because this weight matrix is simply a Dense layer.

So to achieve the same architecture purposed in this book, you can add an extra Dense layer just after the RNN layer.
This was a good point you noticed, and my tip for you is that always when you see a linear transformation (weight matrix + bias) you can implement it using a Dense layer.

You’re very attentive to details, congratulations!
Keep studying, keep learning! :nerd_face:

Hello Rafael,

thanks for the reply!

Actualy you can do it. The book described it as an fixed part of the RNN. But as much as I know, the most used DL frameworks (a.k.a Pytorch and Tensorflow) don’t implement this transformation, because this weight matrix is simply a Dense layer.

Okay, thank you, this clarifies it for me!

Best, JZ

based on literature, i assume h_n and c_n might be types of recurrent states, but it’s frustrating to have them abbreviated in the docs