In case of BiGRU output[-1]
gives you the last hidden state for the forward direction but the first hidden state of the backward direction; see here. If only the last hidden state is fed to a linear layer, it’s therefore more convenient to use hidden
and not output
. For a BiGRU, I would suggest
output, hidden = self.gru(x)
# hidden.shape = (n_layers * n_directions, batch_size, hidden_dim)
hidden = hidden.view(n_layers, n_directions, batch_size, hidden_dim)
# This view() comes directly from the PyTorch docs
# hidden.shape = (n_layers, n_directions, batch_size, hidden_dim)
hidden = hidden[-1]
# hidden.shape = (n_directions, batch_size, hidden_dim)
hidden_forward, hidden_backward = hidden[0], hidden[1]
# Both shapes (batch_size, hidden_dim)
fc_input = torch.cat((hidden_forward, hidden_backward), dim=1)
# fc_input.shape = (batch_size, 2*hidden_dim)
@moorccini This is very verbose; I like to have my code easy to read :). Note that the problem with your line
fc_input = torch.cat((hidden[-1, :, :], hidden[-2, :, :]), dim = 1)
is that you don’t account for for the multiple layers in hidden
, so hidden[-1, :, :]
and hidden[-2, :, :]
will give you the wrong tensors, at least in the general case (maybe it’s correct with n_layers=1
).