From what I understand, when we run a GRU, e.g.
embed = nn.Embedding(vocab_size, embed_size) gru = nn.GRU(embed_size, hidden_size, batch_first = True, bidirectional = True) # src is a bsz x seqlen x vocab_size tensor embedded_src = embed(src) # embedded_src is a bsz x seqlen x embed_size tensor out, hidden = gru(embedded_src) # out is a tensor of seqlen x bsz x hidden_size * n_directions # hidden is a tensor of n_directions * n_layers x seqlen x hidden_size
I’ve seen people concatenate the hidden state, but is it a misunderstanding to assume that
hidden should be identical in any way? e.g. for Seq2Seq models I’ve seen this being done commonly to represent the hidden state:
# this operation doesn't matter if it's batch_first? # returns a tensor of seqlen x n_directions * hidden_dim context = torch.cat((hidden[-2, :, :], hidden[-1, :, :]), dim = 1)
My understanding from the documentation was that
# in batch_first case for 0th example # o[-1] == context
It does seem to be the case for the forward part of the GRU output but not the reverse? Have I misunderstood something? Thanks!