Is there a recommended way to apply the same linear transformation to each of the outputs of an nn.LSTM layer? Suppose I have a decoder language model, and want a hidden size of X but I have a vocab size of Y.
With e.g. Torch’s rnn library I might do something like:
local dec = nn.Sequential()
dec:add(nn.LookupTable(opt.vocabSize, opt.hiddenSize))
dec:add(nn.Sequencer(nn.LSTM(opt.hiddenSize, opt.hiddenSize))
dec:add(nn.Sequencer(nn.Linear(opt.hiddenSize, opt.vocabSize)))
dec:add(nn.Sequencer(nn.LogSoftMax()))
Now doing the LSTM and the softmax is easy in PyTorch - but what is the best way to add in the nn.Linear, or even several layers, e.g. nn.Linear(F.relu(nn.Linear(x)))? Do I just loop over the outputs in forward, or is there a more elegant way?
class Net(nn.Module):
def __init__(self, vocabSz, hiddenSz):
super(Net, self).__init__()
self.emb = nn.Embedding(vocabSz, hiddenSz)
self.dec = nn.LSTM(hiddenSz, hiddenSz)
self.lin = nn.Linear(hiddenSz, vocabSz)
def forward(self, input, hidden):
embeddings = self.emb(input)
out, hidden = self.dec(embeddings, hidden)
# do the linear here for each lstm output.. what's the best way?
return F.softmax(out), hidden
otherwise you won’t be able to forward the state the following step.
And, if it’s likely you’re going to train it with a Cross Entropy Criterion, then output just the logit-state tuple (out, hidden).
@apaszke if out has the shape (seq_len, batch, hidden_size) (as in the output of a unidirectional GRU layer), does out.view(-1, out.size(2)) keeps the exact same values of the third dimension (i.e. hidden size) across the first two dimensions?
Hi @Atcold,
Why do you think the hidden state needs to be forwarded ?
Correct me if I’m wrong but, isn’t the whole purpose of using nn.LSTM over nn.LSTMCell is not having to iterate over the hidden state for each time step (of the sequence)? (Making hidden state kind of irrelevant outside LSTM) ?
cc: @apaszke
If your inputs are independent(like for generating baby names where each name is a separate thing) then yes, there is no need to do this, however, when the the inputs are related, you’ll need to forward the states as well. for example you want to produce Shakespearean text, or something to to that extend, you’ll benefit from doing so.
@live-wire and @Shisho_Sama
Can you guys please follow up on this? I am new to pytorch and I am confused. So to understand what is happening I wrote this small toy code.
I did not have to take the view of the output before applying the Linear layer. Is this because I am using a more recent version of pytorch (1.4) than when the discussion took place? Or I am losing something here? I especially didn’t understand @Shisho_Sama 's comment about inputs being independent.