Multilayer LSTM & OpenNMT-py's `StackedLSTM`


I am wondering why OpenNMT-py defines a class StackedLSTM given that torch’s native LSTM has a num_layers param.

What justifies this choice?
Is it in order to apply dropout between layers?

torch.nn.LSTM don’t support attention mechanism and have to implement attention based on torch.nn.LSTMCell manually. Howerer, torch.nn.LSTMCell doesn’t has a num_layers param. Thus, OpenNMT-py defines a StackedLSTM which supports attention and multi-layers, as a extension to torch.nn.LSTMCell

Hey, thanks for your reply. The post have been flagged by error.

The thing is, the attention mechanism isn’t part of the LSTM anyway (both conceptually and in the code). In ONMT-py, the decoder LSTM feeds its output in the attention layer at each timestep, see

Therefore, I don’t think that the difference comes from attention.

oh, my bad. I mean the context vector, output of attention mechanism is as additional input of LSTM (Bahdanau et al.). But the OpenNMT use another attention strategy from Luong et al., and I think the target is to support input_feed (feed the context vector at each time step as additional input)