From what I understand, when we run a GRU, e.g.
embed = nn.Embedding(vocab_size, embed_size)
gru = nn.GRU(embed_size, hidden_size, batch_first = True, bidirectional = True)
# src is a bsz x seqlen x vocab_size tensor
embedded_src = embed(src)
# embedded_src is a bsz x seqlen x embed_size tensor
out, hidden = gru(embedded_src)
# out is a tensor of seqlen x bsz x hidden_size * n_directions
# hidden is a tensor of n_directions * n_layers x seqlen x hidden_size
I’ve seen people concatenate the hidden state, but is it a misunderstanding to assume that out
and hidden
should be identical in any way? e.g. for Seq2Seq models I’ve seen this being done commonly to represent the hidden state:
# this operation doesn't matter if it's batch_first?
# returns a tensor of seqlen x n_directions * hidden_dim
context = torch.cat((hidden[-2, :, :], hidden[-1, :, :]), dim = 1)
My understanding from the documentation was that
# in batch_first case for 0th example
# o[0][-1] == context
It does seem to be the case for the forward part of the GRU output but not the reverse? Have I misunderstood something? Thanks!