Convert LSTM from Keras to PyTorch


I’m trying to convert LSTM model code from Keras to Pytorch.

# input size
X_train.shape = (10000, 48)
# output size
y_train.shape = (10000, 16)

Here is the original keras model:

model = Sequential()
model.add(Embedding(16, 10, input_length=48))
model.add(Dense(16, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

And this is my Pytorch model:

class LSTM(nn.Module):

    def __init__(self, embedding_size = 10, hidden_size = 50, vocab_size = 16, tagset_size = 16, dropout_rate = 0.1):
        self.embedding_size = embedding_size # 10
        self.hidden_size = hidden_size # 50
        self.vocab_size = vocab_size # 16
        self.tagset_size = tagset_size # 16
        self.dropout_rate = dropout_rate # 0.1

        self.embedding = nn.Embedding(vocab_size, embedding_size) # (16, 10)
        self.lstm = nn.LSTM(embedding_size, hidden_size) # (10, 50)
        self.dropout = nn.Dropout(dropout_rate) # 0.1
        self.hidden2tag = nn.Linear(hidden_size, tagset_size) # (50, 16)

    def forward(self, x):
        embed = self.embedding(x)
        lstm_out, lstm_hidden = self.lstm(embed, None)
        lstm_out = lstm_out[:,-1,:]
        drop_out = self.dropout(lstm_out)
        output = self.hidden2tag(drop_out)
        return output

model = LSTMTagger(10, 50, 16, 16, 0.1)
criterion = nn.BCEWithLogitsLoss()
optimizer = optim.Adam(model.parameters())

I’m not sure whether I’m doing right or not. Any suggestion would be helpful.

There are a few things where Keras has defaults that are different to PyTorch’s that you would want to aware of. Using the forum search function with Keras LSTM, you will find a number of threads on the subject, including the one below that might be a good starting point.

Best regards


Thank you for your reply. I’m not sure whether my input size in each layer is correct or not. Suppose the size of the input data is (256,48). After passing through an embedding layer with 10 units, it will become (256,48,10). Now followed by an LSTM layer with 50 units, the size will be (256,48,50). However, the expected output shape should be (256,50), so I use lstm_out = lstm_out[:,-1,:] to adjust the size. Is this a reasonable adjustment? Thank you.

So one thing you need to do to get it to work is to pass batch_first to the LSTM instantiation if that is what you want.

While taking the last timestep (as you do with lstm_out[:, -1, :]) is certainly a common way to set up sequence-to-one problems (assuming your inputs are of the same length), I would not call it a “size adjustment”. It says that the LSTM should “memorize” the relevant information. Other ways are possible, in’s (Howard and Ruder) ULMFiT, they recommend to concatenate the last timestep with the average and max over time (but then the linear would have 3 * hidden_size inputs, of course).

Best regards


1 Like