In language modelling, when using RNNs the output (lets denote it by h) of your RNN is almost always a tensor with dimensions:
h.shape = [time, batch_size, hidden_size]
And from here, a common practice is to use a “decoding” linear layer:
decoder = nn.Linear(hidden_size, vocab_size)
to obtain logits with dimensions:
logits.shape = [time, batch_size, vocab_size]
Now, I have seen people doing this in two ways. One:
logits = decoder(h)
where this boils down to a matrix multiplication of the 3D tensor h with the 2D tensor inside the decoder.
Two:
logits = decoder(h.view(time * batch_size, hidden_size))
i.e. they first transform the 3D to a 2D tensor and then they pass it to the decoder, which boils down to a matrix multiplcation between two 2D tensors. Then based on what format do we want the logits to be in, we can reshape with:
logits = logits.view(time, batch_size, vocab_size)
So, finally my questions:
- Are the two approaches identical?
- If not why?
- If yes, is there a best practice on which to use and why?