Linear layers on top of LSTM

douwe · February 15, 2017, 3:43pm

Is there a recommended way to apply the same linear transformation to each of the outputs of an nn.LSTM layer? Suppose I have a decoder language model, and want a hidden size of X but I have a vocab size of Y.

With e.g. Torch’s rnn library I might do something like:

local dec = nn.Sequential()
dec:add(nn.LookupTable(opt.vocabSize, opt.hiddenSize))
dec:add(nn.Sequencer(nn.LSTM(opt.hiddenSize, opt.hiddenSize))
dec:add(nn.Sequencer(nn.Linear(opt.hiddenSize, opt.vocabSize)))
dec:add(nn.Sequencer(nn.LogSoftMax()))

Now doing the LSTM and the softmax is easy in PyTorch - but what is the best way to add in the nn.Linear, or even several layers, e.g. nn.Linear(F.relu(nn.Linear(x)))? Do I just loop over the outputs in forward, or is there a more elegant way?

class Net(nn.Module):
    def __init__(self, vocabSz, hiddenSz):
        super(Net, self).__init__()
        self.emb = nn.Embedding(vocabSz, hiddenSz)
        self.dec = nn.LSTM(hiddenSz, hiddenSz)
        self.lin = nn.Linear(hiddenSz, vocabSz)
    def forward(self, input, hidden):
        embeddings = self.emb(input)
        out, hidden = self.dec(embeddings, hidden)
        # do the linear here for each lstm output.. what's the best way?
        return F.softmax(out), hidden

apaszke · February 15, 2017, 3:51pm

I’d do this:

out, hidden = self.dec(embeddings, hidden)
out = self.lin(out.view(-1, out.size(2))
return F.softmax(out)

Atcold · February 15, 2017, 11:29pm

I think you need

return F.softmax(out), hidden

otherwise you won’t be able to forward the state the following step.
And, if it’s likely you’re going to train it with a Cross Entropy Criterion, then output just the logit-state tuple (out, hidden).

douwe · February 16, 2017, 9:39am

edited example code – typo when transcribing, thanks.

csarofeen · February 16, 2017, 6:17pm

You’ll need to view the output before sending it to the linear layer (batch will be seq*batch for linear layer) like in the world language example. https://github.com/pytorch/examples/blob/master/word_language_model/model.py

mfa · July 5, 2017, 9:24am

@apaszke if out has the shape (seq_len, batch, hidden_size) (as in the output of a unidirectional GRU layer), does out.view(-1, out.size(2)) keeps the exact same values of the third dimension (i.e. hidden size) across the first two dimensions?

simono · January 25, 2018, 5:08pm

Yes, it just basically folds together the first two dimensions, the third will stay unaffected.

Example output from my code:

out
Variable containing:
( 0 ,.,.) = 
  0.0426 -0.2047 -0.0125  ...  -0.0409  0.1487  0.0376
 -0.0744 -0.0927  0.0185  ...  -0.1620  0.1348 -0.2179
  0.0111 -0.1565 -0.0192  ...  -0.1247  0.0352 -0.0625

( 1 ,.,.) = 
  0.0726 -0.1911  0.3049  ...  -0.1031  0.1991  0.0659
 -0.2233  0.1622  0.0794  ...  -0.1720  0.1020 -0.0430
  0.0600 -0.3536  0.0201  ...  -0.0990 -0.1203 -0.1207

( 2 ,.,.) = 
  0.0000  0.0000  0.0000  ...   0.0000  0.0000  0.0000
 -0.1225  0.1387  0.1205  ...  -0.1052 -0.1638 -0.2314
  0.0336 -0.3169  0.0090  ...  -0.0343  0.0635 -0.0280

( 3 ,.,.) = 
  0.0000  0.0000  0.0000  ...   0.0000  0.0000  0.0000
  0.0001 -0.1604  0.1111  ...  -0.0778 -0.1291 -0.0852
  0.0000  0.0000  0.0000  ...   0.0000  0.0000  0.0000
[torch.FloatTensor of size 4x3x128]

out.view(-1, out.size(2))
Variable containing:
 0.0426 -0.2047 -0.0125  ...  -0.0409  0.1487  0.0376
-0.0744 -0.0927  0.0185  ...  -0.1620  0.1348 -0.2179
 0.0111 -0.1565 -0.0192  ...  -0.1247  0.0352 -0.0625
          ...             ⋱             ...          
 0.0000  0.0000  0.0000  ...   0.0000  0.0000  0.0000
 0.0001 -0.1604  0.1111  ...  -0.0778 -0.1291 -0.0852
 0.0000  0.0000  0.0000  ...   0.0000  0.0000  0.0000
[torch.FloatTensor of size 12x128]

live-wire · September 27, 2018, 1:22pm

Hi @Atcold,
Why do you think the hidden state needs to be forwarded ?
Correct me if I’m wrong but, isn’t the whole purpose of using nn.LSTM over nn.LSTMCell is not having to iterate over the hidden state for each time step (of the sequence)? (Making hidden state kind of irrelevant outside LSTM) ?
cc: @apaszke

Shisho_Sama · October 20, 2019, 11:24am

If your inputs are independent(like for generating baby names where each name is a separate thing) then yes, there is no need to do this, however, when the the inputs are related, you’ll need to forward the states as well. for example you want to produce Shakespearean text, or something to to that extend, you’ll benefit from doing so.

shahensha · June 11, 2020, 4:16am

@live-wire and @Shisho_Sama
Can you guys please follow up on this? I am new to pytorch and I am confused. So to understand what is happening I wrote this small toy code.

voc_size = 100
n_labels = 3
emb_dim = 16
rnn_size = 32
embedding = nn.Embedding(voc_size, emb_dim)
rnn = nn.LSTM(input_size=emb_dim, hidden_size=rnn_size, bidirectional=True, num_layers=1)
top_layer = nn.Linear(2 * rnn_size, n_labels)

sentences = torch.randint(high=voc_size, size=(10, 4))
print(sentences.shape)

embedded = embedding(sentences)
print(embedded.shape)

rnn_out, _ = rnn(embedded)
print(rnn_out.shape)

out = top_layer(rnn_out)
print(out.shape)

The output is as follows:

torch.Size([10, 4])
torch.Size([10, 4, 16])
torch.Size([10, 4, 64])
torch.Size([10, 4, 3])

I did not have to take the view of the output before applying the Linear layer. Is this because I am using a more recent version of pytorch (1.4) than when the discussion took place? Or I am losing something here? I especially didn’t understand @Shisho_Sama 's comment about inputs being independent.