Hidden state of each sequence of mini-batch

Mainul · August 9, 2019, 9:56am

I am new to Pytorch and trying to implement a lstm character level seq2seq model. What I am trying to do is: Each sequence is a list of the characters of a particular word and several words will create a minibatch, which also represent a sentence. Now, in my understanding for each sequence (list of character embedding, in my case), there will be a final hidden state. So, if there is two character sequence (two words), there will be two hidden state each representing a word. I am not even considering the variable-length sequence. I also don’t understand why it should be a problem if the sequence length is variable. Should not the lstm loop until there are elements in each particular sequence? The number of iteration should not be static, right? Here is my code I tried:

character_embedding = nn.Embedding(17,5)
#LSTM with input embedding dimention 5, and expected hidden state dimention 3
lstm = nn.LSTM(5,3)
#each vector is a word and there are two words with same number of character
char_embeds=character_embedding(torch.tensor([[1,2,3,4,5],[4,5,6,7,8]]))
#out will contain all the hidden states for each character and hidden sould contain final hidden state for each sequence
out, hidden=lstm(char_embeds)
print("char_embeds: ")
print(char_embeds)
print("hidden: ")
print(hidden[0])

Output:

char_embeds: 
tensor([[[ 1.0157, -0.2197,  1.6615, -1.2916, -0.6116],
         [ 0.5630, -0.9618,  0.7287, -0.5727,  1.6796],
         [ 0.9902, -0.5408,  0.9785, -1.1090,  1.1126],
         [ 0.7472,  0.0440,  1.0629, -0.7375,  0.0828],
         [ 0.6632, -0.4523,  0.5051,  2.6031,  0.3798]],

        [[ 0.7472,  0.0440,  1.0629, -0.7375,  0.0828],
         [ 0.6632, -0.4523,  0.5051,  2.6031,  0.3798],
         [-0.6522, -3.2626,  0.7967, -1.0322,  0.4667],
         [-0.5086,  0.5142, -0.7141, -1.5352,  0.4177],
         [-0.0582,  1.3398, -0.2829,  0.1392,  1.0709]]],
       grad_fn=<EmbeddingBackward>)
hidden: 
tensor([[[-0.2774,  0.0724, -0.4297],
         [-0.4580,  0.1563, -0.5811],
         [-0.5492, -0.2314,  0.3473],
         [-0.0772,  0.2474, -0.1026],
         [-0.1042,  0.4394, -0.3582]]], grad_fn=<StackBackward>)

Here, I would expect two hidden states, as there are two sequences. But I am getting 5 hidden states. What is that? What I am missing?

The second question is, Why can’t LSTM can not handle variable-length sequences?

vdw · August 10, 2019, 10:31am

You get 5 hidden states since you have 2 sequences of length 5: [1,2,3,4,5] and [4,5,6,7,8]. 2 is here your batch size. You do not have a sequence of 2 words, but a minibatch of 2 sequences (representing words), each with 5 characters. The LSTM will loop over the 5 characters yielding 5 hidden states for each sequence.

Please note, that there’s no semantic when you say that a minibatch is a sentence. Sure, all the words in the minibatch may come from a single sentence, but for the LSTM it’s just bag of words with no connection.

LSTM can handle variable sequence lengths just fine as long all sequences in the batch have the same length. This is why padding might be needed. For example, you cannot have [1,2,3,4,5] and [4,5,6] in the same batch since they have different lenghts, but you could pad it to, e.g., [4,5,6,0,0]. Also note that if your minibatches have sequences with different lentghs, you need to re-initialize your hidden state w.r.t. the the current sequence length of the current batch (for an example, see the init_hidden method here).

Mainul · August 11, 2019, 12:38pm

Thanks, Chris, for your answer.
As far as I understood from the theory of rnn/lstm, that for each element of a sequence there will be hidden state and at the end of the sequence there will be a final hidden state. Also, there is another state of lstm other than the hidden state which is called cell state.
In pytorch, lstm will return two things. The first one is hidden states of each element of the sequences, the variable out should hold this in my case. The second will hold final hidden states of each sequence along with the cell state. In my example, the hidden variable should hold these and hidden[0] should hold the sequence final states and hidden[1] should hold the cell state.

So, if there are two sequences in a mini-bach, should not be there 2 final hidden states hold by variable hidden[0] ?

vdw · August 13, 2019, 2:02am

You only need to have a look at the documentation:

Yeah, the output of an LSTM is output, (h_n, c_n)
The shape of output is (seq_len, batch, num_directions * hidden_size), where batch is the batch size, i.e., the number of sequences in your batch
The shape of h_n is (num_layers * num_directions, batch, hidden_size); again, batch being the batch size.

So if you want the all hidden states of the 2nd sequence, you could do output[:,1,:] or first reshape ouput = ouput.view(0,1) and then output[1].

Please note that you need to be bit more careful in case you define your LSTM with bidirectional=True; see a previous post of mine (the image in the post my also help with the understanding of the output of an LSTM, although it ignores the batch dimension).

Mainul · August 13, 2019, 10:18am

Thank you very much Chris for your reply.
I think I have got the solution. But to be clear, please check whether I am right.
The problem was, I was expecting 2 final hidden state as the batch size is 2 and I assumed that lstm taking the input as (batch, sequence, embedding) but I noticed that that is not the default case. Lstm takes input as (sequence, batch, embedding). So I have to make the batch_first=true
The output I got,
hidden[0]:

tensor([[[-0.0538,  0.1163,  0.0256],
         [ 0.0165,  0.1825, -0.0107]]], grad_fn=<StackBackward>)

Which I was expecting.

vdw · August 13, 2019, 11:34pm

Yes, you always need to be careful that you provide the correct shape of the input. LSTM is fine with both (2,5,100) and (5,2,100) since the difference is not on a technical level but depends on your context.

As a side node, this is why the view() command is dangerous. Just because the shape is correct (e.g., a subsequent layer doesn’t complain about the shape) after applying view() does not mean that from a semantic point of view the reshaping was done correctly.

Mainul · August 14, 2019, 2:47pm

Thank you very much for the heads up about the view() command.

The default shape of lstm input and output (sequence length, batch size, feature) is counter- intuitive to me (as a beginner). I am still wondering, why this shape has been chosen as the default?
Also, I think, they should be more clear about this from the beginning of their tutorials.

vdw · August 15, 2019, 2:49am

I would assume that (seq_len, batch_size, hidden_dim) is just a design decision to reflect the focus on the time steps inherent to RNNs such as LSTMs or GRUs. For example, you can do output[-1] to get the output of the last step.

Usually, in my forward() method, I keep note of the shape after each processing step (in case of view() I even note why the reshaping is needed). This helps me a lot :). Checking the required input shapes and resulting output shapes is probably most frequently check in the PyTorch docs.