Bahdanau's Attention

Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation proposed modified version of RNN which is used in Neural Machine Translation by Jointly Learning to Align and Translate. In modification of RNN they are changing hidden layer computation of RNN. Many people have implemented this in PyTorch but I seen all of them have just used base RNN models in both Encoder and Decoder which actually needs internal modification. So is there any way we can change hidden layer computation of RNN or LSTM? or If not, how can I build it from scratch in PyTorch?

Posting links here would make responding easier.

Bahdanau’s formulation doesn’t seem to require any modifications to the built in RNN.

From the equations in Section A (Appendix) of the first paper, you could possibly write an inefficient(?) implementation using some primitives. Check the following pseudocode:

class EncoderRNNCell(nn.Module):
    def __init__(self, args):
        def _linear():
            return nn.Linear(args.input_size, args.hidden_size)
        self.We = _linear()
        self.Ue = _linear()

        self.Wz = _linear()
        self.Uz = _linear()

        self.Wr = _linear()
        self.Ur = _linear()

    def forward(self, xs):
        # xs: T x args.input_size
        T, H = x.size()
        # TODO: Init h, r with legit values

        hs = [h]
        for i in range(T):
            x = xs[i, :]
            h_ = F.tanh(self.We(x) + self.Ue(r*h))
            z = F.sigmoid(self.Wz(x) + self.Uz(h))
            r = F.sigmoid(self.Wr(x) + self.Ur(h))
            h = z*h + r*h_

        hiddens = torch.stack(hs, dim=...)
        return hiddens

You can use hiddens[-1] as h<T>.

I just find out that proposed model in 1st paper is nothing but GRU. But problem comes in the decoder part of 2nd paper. They have give additional input to the GRU which is context vector ‘c’.

Thank you for reply @jerinphilip. RNN or LSTM is combination of Linear layers, is it how it is implemented??

Under the hood, not exactly. There should be some more optimizations to parallel-process (something like this) taking advantage of GPUs for class of RNN Models. These are implemented in the CPP backend ( .

But you should be able to approximate the behaviour using composition of nn.Linear layers, I think. GRU and LSTM will serve as a substitute which could be adapted for your use case - but your query seemed like you wanted more flexibility.

@jerinphilip, I think create RNN/GRU from scratch would require too many other implementation and when it comes to Bidirectional RNN or Stacked RNN it would become more complex to implement from scratch.

I found one implementation of Bahdanau’s Attention where he just combine this context vector with input embedding vector to overcome 3 input RNN/GRU problem.

1 Like