In case of BiGRU `output[-1]`

gives you the last hidden state for the forward direction but the first hidden state of the backward direction; see here. If only the last hidden state is fed to a linear layer, it’s therefore more convenient to use `hidden`

and not `output`

. For a BiGRU, I would suggest

```
output, hidden = self.gru(x)
# hidden.shape = (n_layers * n_directions, batch_size, hidden_dim)
hidden = hidden.view(n_layers, n_directions, batch_size, hidden_dim)
# This view() comes directly from the PyTorch docs
# hidden.shape = (n_layers, n_directions, batch_size, hidden_dim)
hidden = hidden[-1]
# hidden.shape = (n_directions, batch_size, hidden_dim)
hidden_forward, hidden_backward = hidden[0], hidden[1]
# Both shapes (batch_size, hidden_dim)
fc_input = torch.cat((hidden_forward, hidden_backward), dim=1)
# fc_input.shape = (batch_size, 2*hidden_dim)
```

@moorccini This is very verbose; I like to have my code easy to read :). Note that the problem with your line

```
fc_input = torch.cat((hidden[-1, :, :], hidden[-2, :, :]), dim = 1)
```

is that you don’t account for for the multiple layers in `hidden`

, so `hidden[-1, :, :]`

and `hidden[-2, :, :]`

will give you the wrong tensors, at least in the general case (maybe it’s correct with `n_layers=1`

).