I need some clarity on how to correctly prepare inputs for different components of nn, mainly nn.Embedding, nn.LSTM and nn.Linear for case of batch training. I want to use these components to create an encoder-decoder network for seq2seq model. There are lots of examples I find online but they confuse me.

Consider an example where I have,

Embedding followed by 2) LSTM followed by 3) Linear unit:

1. nn.Embedding
Input: batch_size x seq_length
Output: batch-size x seq_length x embedding_dimension

2. nn.LSTM
Input: seq_length x batch_size x input_size (embedding_dimension in this case)
Output: seq_length x batch_size x hidden_size
last_hidden_state: batch_size, hidden_size
last_cell_state: batch, hidden_size

For output of Embedding to be input for LSTM, I need to transpose axis 1 and 2.
I see many examples online which do something like x = embeds.view(len(sentence), self.batch_size , -1) which confuses me. How does this view ensure that elements of the same batch remain in the same batch? What happens when len(sentence) and self.batch size are of same size?

3. nn.Linear
Input: batch_size x input_size (hidden_size of LSTM in this case or ??)
Output: batch_size x output_size

If i want to consume the last_hidden_state of LSTM only, then I can use it as is for nn.Linear.
But if I want to make use of Output (which contains all intermediate hidden states as well) then I need to change nn.Linear’s input size to seq_length x hidden_size and to use Output as input to Linear module I need to transpose axis 1 and 2 of output and then I can view with Output_transposed(batch_size, -1).

Is my understanding here correct? If yes, how do I carry out these transpose operations in tensors (tensor.transpose(0 1))?
I know its a newbie question, but instead of doing something which ‘runs’, I want to understand how to do it correctly.
Please help!!

I think that if you give an nn.Embedding input of shape (seq_len, batch_size), then it will happily produce output of shape (seq_len, batch_size, embedding_size). Embedding expects 2d input and replaces every element with a vector. Thus the order of the dimensions of the input has no importance.

Your LSTM input and output sizes look mostly good to me. This post helped me get my head around them. Understanding output of lstm

You can initialise nn.LSTM with batch_first=True if you need to switch the seq_len and batch_size dimensions of the input and output.

If the input to nn.Embedding is appropriately shaped then I can’t see why a .view operation before the LSTM should be necessary.

For consuming last hidden state only…

lstm_output, (last_hidden_state, last_cell_state) = self.lstm(embedded)
linear_input = last_hidden_state[-1] # get hidden state for last layer
# or equivalently
linear_input = lstm_output[-1] # get last step of output

For consuming the hidden states of the whole sequence

Note that in this case the sequence length must always be the same.

Most tensor ops work on Variables too, which is necessary if you want to backpropagate.
If you operate on tensors directly then those operations are not stored in the computation graph and cannot be backpropagated.

@jpeg729@Gaurav_Koradiya
Can you guys please follow up on this? I am new to pytorch and I am confused. So to understand what is happening I wrote this small toy code.

I did not have to take the view of the output before applying the Linear layer. Is this because I am using a more recent version of pytorch (1.4) than when the discussion took place? Or I am losing something here?