I have a question specific to using some pretrained output linear projection layer of one LSTM network for another.
What I have:
- A pretrained fully connected layer which was used to output word probabilities into the actual words for some language generation model, something like this:
self.output_linear_projection = nn.Linear(self.wordRNN_dim, self.vocab_size)
self.wordRNN_dimis 512 (hidden size of LSTM), and
self.vocab_sizeis the number of words I have. For this pretrained model, vocabulary size is 10509, where it has 10508 words and the last element is projection of the
<start>token (same projection is used for these tokens)
- My language generation model, for which I want to use the output linear projection layer from the pretrained model. Vocabulary of my model also includes
<pad>token, which is absent in the pretrained model. So my vocabulary’s size is 10510 then.
- if I want to use this pretrained output linear projection layer in my model, what should/could I do? Is it somehow conventional to not project
<pad>tokens? If so, should I somehow ignore it in my projection and how? And if there are any steps I should take, any hints on how to do them in PyTorch?