It looks like we often provide our own embedding, prior to LSTM, and then assign input_size == hidden_size
, for the LSTM, eg http://pytorch.org/tutorials/intermediate/seq2seq_translation_tutorial.html :
self.embedding = nn.Embedding(input_size, hidden_size)
self.gru = nn.GRU(hidden_size, hidden_size)
It seems like this is kind of ‘wasteful’, since it’s adding an additional hidden_size x hidden_size
matrix multiply at the input of the LSTM, which we dont need in fact?