Output of a GRU layer

Hey guys
I defined a GRU layer like this with n_layers = 2

self.gru = nn.GRU(embed_dim, hidden_dim, n_layers)

self.fc = nn.Linear(hidden_dim, output_dim)

in the forward function, what should I give to the fc layer?

output, hidden = self.gru(x)

fc_input = output[-1]

or

fc_input = torch.cat((hidden[-1, :, :], hidden[-2, :, :]), dim = 1)

return torch.sigmoid(self.fc(fc_input ))

What about having a bidirectional model? what would be the fc_input in that case?
Thank you.

The input to the fully-connected layer should be (in sequence classification tasks) output[-1]. hidden is usually passed to the decoder in seq2seq models.

In case of a bidirectional model, the last dimension of the output is doubled in size so the output shape is (seq_len, batch, 2 * hidden_size).

To combine these directions (the forward and backward direction) are, some of the options are:

  • they can be summed
  • affine transformation can be applied to them (they are propageted through a neural net (nn.Linear(2 * hidden_size, next_layer_input_dim), that network is trained at the same time as the rest of the model)
  • they are left untouched and passed to the next layer (but in this case the input dim of the next layer has to be doubled in size (2 * hidden_dim))
  • etc.
1 Like

In case of BiGRU output[-1] gives you the last hidden state for the forward direction but the first hidden state of the backward direction; see here. If only the last hidden state is fed to a linear layer, it’s therefore more convenient to use hidden and not output. For a BiGRU, I would suggest

output, hidden = self.gru(x)
# hidden.shape = (n_layers * n_directions, batch_size, hidden_dim)
hidden = hidden.view(n_layers, n_directions, batch_size, hidden_dim)
# This view() comes directly from the PyTorch docs
# hidden.shape = (n_layers, n_directions, batch_size, hidden_dim)
hidden = hidden[-1]
# hidden.shape = (n_directions, batch_size, hidden_dim)
hidden_forward, hidden_backward = hidden[0], hidden[1]
# Both shapes (batch_size, hidden_dim)
fc_input = torch.cat((hidden_forward, hidden_backward), dim=1)
# fc_input.shape = (batch_size, 2*hidden_dim)

@moorccini This is very verbose; I like to have my code easy to read :). Note that the problem with your line

fc_input = torch.cat((hidden[-1, :, :], hidden[-2, :, :]), dim = 1)

is that you don’t account for for the multiple layers in hidden, so hidden[-1, :, :] and hidden[-2, :, :] will give you the wrong tensors, at least in the general case (maybe it’s correct with n_layers=1).

1 Like

Thank you very much, it helped me alot

@vdw Thanks for reminding me. It doesn’t make much sense to combine the last hidden state of the forward pass with the first hidden state of the backward pass in sequence classification tasks.

@moorccini In the bidirectional case, output[-1, :, :hidden_size] gives you the last hidden state of the forward pass, and output[0, :, hidden_size:] gives you the last hidden state of the backward pass. Then you can concat these states/transform them as desired. But I like @vdw’s solution better, since it seems more elegant. If you just pass the output[-1] (which has the first hidden state of the backward pass) of the bidirectional model to the next layer, in a sequence classification task, the model performance could be even worse compared to the unidirectional model. The output shape makes the output suitable for sequence labelling, so it can be a bit tricky to get it right in sequence classification tasks.

1 Like