[LANGUAGE MODELS] From hidden to logits, 3D and 2D tensor multiplication

In language modelling, when using RNNs the output (lets denote it by h) of your RNN is almost always a tensor with dimensions:

h.shape = [time, batch_size, hidden_size]

And from here, a common practice is to use a “decoding” linear layer:

decoder = nn.Linear(hidden_size, vocab_size)

to obtain logits with dimensions:

logits.shape = [time, batch_size, vocab_size]

Now, I have seen people doing this in two ways. One:

logits = decoder(h)

where this boils down to a matrix multiplication of the 3D tensor h with the 2D tensor inside the decoder.

Two:

logits = decoder(h.view(time * batch_size, hidden_size))

i.e. they first transform the 3D to a 2D tensor and then they pass it to the decoder, which boils down to a matrix multiplcation between two 2D tensors. Then based on what format do we want the logits to be in, we can reshape with:

logits = logits.view(time, batch_size, vocab_size)

So, finally my questions:

  1. Are the two approaches identical?
  2. If not why?
  3. If yes, is there a best practice on which to use and why?

The main reason they will collapse the BATCH and SEQ_LENGTH axis to a single axis is because of the CrossEntropyLoss layer, which takes the tensor of shape “ITEMS*NUM_CLASSES”. Other than this they are pretty much the same.

1 Like

That makes sense. Thanks a lot on your fast answer! :slight_smile: