I don’t use Pytorch as often as I should, so I always need to consult the documentation. And now I came across an issue that was well documented in the previous versions but at least not in the current one (as far as I can tell).
The question is about handling the last hidden state h_n
for an nn.LSTM
layer (but same with nn.GRU
). The issue is that the one dimension is the product of num_layers
and num_directions
. The documentation for Pytorch version 1.0.0 is pretty clear:
h_n
of shape(num_layers*num_directions, batch, hidden_size)
: tensor containing the hidden state fort=seq_len
. Likeoutput
, the layers can be separated usingh_n.view(num_layers, num_directions, batch, hidden_size)
and similarly forc_n
.
It gives a concrete example of how to separate the num_layer
and num_directions
dimensions. And this is what I always used in my implementations. However, the documentation for Pytorch version 1.1.3 reads as follows:
h_n
: tensor of shape(D*num_layers, H_out)
for unbatched input or(D*num_layers, N, H_out)
containing the final hidden state for each element in the sequence. Whenbidirectional=True
,h_n
will contain a concatenation of the final forward and reverse hidden states, respectively.
While mapping the names (D=num_directions
, N=batch
, H_out=hidden
) is straightforward, the documentation is now missing the way to split D
and num_layers
. It’s tempting to adopt the old method:
h_n.view(num_layers, D, N, H_out)
but note that here the order of D
and num_layers
is now flipped: (D*num_layers, …)
vs (num_layers*num_directions)
. Does this mean I now have to do:
h_n.view(D, num_layers, N, H_out)
I’m pretty sure the order does matter, but I derive for certain which version is the correct one. Or what am I missing here?