Output of a GRU layer

In case of BiGRU output[-1] gives you the last hidden state for the forward direction but the first hidden state of the backward direction; see here. If only the last hidden state is fed to a linear layer, it’s therefore more convenient to use hidden and not output. For a BiGRU, I would suggest

output, hidden = self.gru(x)
# hidden.shape = (n_layers * n_directions, batch_size, hidden_dim)
hidden = hidden.view(n_layers, n_directions, batch_size, hidden_dim)
# This view() comes directly from the PyTorch docs
# hidden.shape = (n_layers, n_directions, batch_size, hidden_dim)
hidden = hidden[-1]
# hidden.shape = (n_directions, batch_size, hidden_dim)
hidden_forward, hidden_backward = hidden[0], hidden[1]
# Both shapes (batch_size, hidden_dim)
fc_input = torch.cat((hidden_forward, hidden_backward), dim=1)
# fc_input.shape = (batch_size, 2*hidden_dim)

@moorccini This is very verbose; I like to have my code easy to read :). Note that the problem with your line

fc_input = torch.cat((hidden[-1, :, :], hidden[-2, :, :]), dim = 1)

is that you don’t account for for the multiple layers in hidden, so hidden[-1, :, :] and hidden[-2, :, :] will give you the wrong tensors, at least in the general case (maybe it’s correct with n_layers=1).

1 Like