Relationship between output and hidden in bidirectional GRU

cerebralnetworkdev · March 5, 2021, 2:42pm

From what I understand, when we run a GRU, e.g.

embed = nn.Embedding(vocab_size, embed_size)
gru = nn.GRU(embed_size, hidden_size, batch_first = True, bidirectional = True)

# src is a bsz x seqlen x vocab_size tensor
embedded_src = embed(src)

# embedded_src is a bsz x seqlen x embed_size tensor
out, hidden = gru(embedded_src) 

# out is a tensor of seqlen x bsz x hidden_size * n_directions
# hidden is a tensor of n_directions * n_layers x seqlen x hidden_size

I’ve seen people concatenate the hidden state, but is it a misunderstanding to assume that out and hidden should be identical in any way? e.g. for Seq2Seq models I’ve seen this being done commonly to represent the hidden state:

# this operation doesn't matter if it's batch_first?
# returns a tensor of seqlen x n_directions * hidden_dim
context = torch.cat((hidden[-2, :, :], hidden[-1, :, :]), dim = 1)

My understanding from the documentation was that

# in batch_first case for 0th example
# o[0][-1] == context

It does seem to be the case for the forward part of the GRU output but not the reverse? Have I misunderstood something? Thanks!