Observation: really like the seq2seq2 tutorial. 👍 :-)

hughperkins · July 10, 2017, 10:17am

this one: http://pytorch.org/tutorials/intermediate/seq2seq_translation_tutorial.html

tjppires · July 10, 2017, 10:43am

I agree it was a cool tutorial, but I also think it could have been better:

Both the encoder and the decoder use a loop in the number of layers, which basically forces the network to share weights across layers. This is not usual, and if it was intended, it could have been explicit in the text.
It could have dealt with attention better. It basically admits that all sentences have the same length, and if they don’t, then they won’t use all attention weights. This basically means that the attention weights used may not sum to 1, as it is supposed.

deepcode · July 10, 2017, 4:45pm

so that means its not really usable as is? does it need to been fixed/redone?

alexis-jacq · July 10, 2017, 5:15pm

Both the encoder and the decoder use a loop in the number of layers, which basically forces the network to share weights across layers. This is not usual, and if it was intended, it could have been explicit in the text.

The number of layer is set to 1, so the loop could be simply removed and it would still work properly.

It could have dealt with attention better. It basically admits that all sentences have the same length, and if they don’t, then they won’t use all attention weights. This basically means that the attention weights used may not sum to 1, as it is supposed.

All the sentences do have the same lengths, after pre-processing the data. But it is free to us to extend this tutorial with variable lengths, in order to explain the pack_padding trick, using an RNN cell for attention (instead of the linear layer). That would be great !

spro · July 10, 2017, 9:26pm

Check the updated https://github.com/spro/practical-pytorch/blob/master/seq2seq-translation/seq2seq-translation.ipynb for solutions to both those problems.

Sharing weights was a mistake at first, I kept it because it actually converges faster (not sure if there are papers on this but someone should write one) but it should have been explained. The updated version does the normal thing, using the n_layers argument of the nn.GRU constructor.
The attention implementation was definitely poor. The updated version covers multiple “correct” versions of attention as seen in Effective Approaches to Attention-based Neural Machine Translation which do not suffer from the fixed length problem. (In fact that paper also touches upon a model called “global location” attention which is closer to the aforementioned implementation, and is shown to perform poorly.)

I keep meaning to merge this into the official tutorials repo…

chsasank · July 11, 2017, 3:00am

@spro Let me know if you need any help merging.

tjppires · July 11, 2017, 4:33pm

The number of layer is set to 1, so the loop could be simply removed and it would still work properly.

Agree, but just because n_layers = 1. Otherwise it would be weird.

All the sentences do have the same lengths, after pre-processing the data.

True, but the smaller sentences were padded, so part of the attention is in the padding, which is not what you want.

tjppires · July 11, 2017, 4:38pm

Thanks @spro!

Just one question, is there a good way to mask the attention? Even using the attention from the paper from Manning you mentioned, if the sentences in a batch are of different lengths, you will need to mask some positions. Right now I am directly filling the attention matrix (before applying softmax) with -float(“inf”), but this seems a bit hacky.

hughperkins · July 13, 2017, 11:09am

Just noticed, this doesnt share the embedding across encoder-input/decoder-input/output? Per https://arxiv.org/abs/1608.05859 might be interesting to do that? Also, this is used by the ‘attention is all you need’ paper, https://arxiv.org/abs/1706.03762 . Thoughts on how this would be implemented in a pytorch-idiomatic way? (I can put this into a separate/new thread/topic/question perhaps?)

mfa · July 13, 2017, 12:27pm

It is certainly quite an interesting tutorial. @spro one thing that I think can be added is running the layers on cuda when cuda is available (USE_CUDA = True). In the current version, when cuda is available, the Variables run on cuda but the model/layers don’t.

spro · July 14, 2017, 3:31am

afaik weight sharing doesn’t make sense for translation because the embeddings on the encoder and decoder side are totally different (two different languages)… for cases like CharRNN or WordRNN it makes sense and is very (almost too) easy to implement - see https://github.com/pytorch/examples/blob/f2a771a8a2f3a38ec15b11f6f19ac38c8bbaa900/word_language_model/model.py#L28-L31

Edit: it does make sense on the decoder side inputs & outputs

hughperkins · July 14, 2017, 7:08am

well, per the paper above, french and english share enough words that it worked ok for them.

As far as copying the weights, I think that means the second embedding has already been allocated. And then we throw it away. That seems a bit ‘unclean’ to me. What if the embedding is huge?

Would prefer to be able to create the embedding once, then re-use it. In idiomatic pytorch.

VladislavPrh · July 14, 2017, 8:01am

Couldn’t we use shared embeddings just like that?

class Seq2Seq(nn.Module):
    def __init__(self):
        super(Seq2Seq, self).__init__()
        self.embed_seq2seq= nn.Embedding(len(vocab), 75, padding_idx=vocab("<pad>"))
        self.lstm_enc = nn.LSTM(75, 150, 2, batch_first=True, bidirectional=False)
        self.lstm_dec = nn.LSTM(75, 150, 2, batch_first=True, bidirectional=False)
        self.linear = nn.Linear(150, 1)

And then in the forward function use self.embed_seq2seq for encoder and decoder

hughperkins · July 14, 2017, 8:19am

Yes, maybe

Somewhat related question: how to get the output of the encoder and decoder in a pytorch idiomatic way, and handling teacher forcing etc?

soemthing like? :

def forward(self, encoder_input, decoder_input, state):
    if encoder_input is not None:
        enc_input_emb = self.embed_seq2seq(encoder_input)
        enc_out, state = self.lstm_enc(enc_input_emb, state)
    if decoder_input is not None:
        decoder_input_emb = self.embed_seq2seq(decoder_input)
        dec_out, state = self.lstm_dec(decoder_input_emb, state)

    embedding_size = self.embed_seq2seq.size()[1]
    batch_size = decoder_input.size()[1]
    seq_len = decoder_input.size()[0]
    dec_out_unemb = dec_out.view(-1, embedding_size) @ self.embed_seq2seq.weight.transpose(0, 1)
    dec_out_unemb = dec_out_unemb.view(seq_len, batch_size, -1)
    return enc_out, dec_out, dec_out_unemb, state

?

tjppires · July 14, 2017, 9:09am

Here you go. A pytorch implementation of the model in “Attention is All You Need”.

Thomas_Scialom · November 10, 2017, 11:58am

Thanks for the reply, can you also explain why attn_combine is a cat of att_applied and embeded; I did it with att_applied and prev hidden instead of embeded and it works quite simmilar.

yzexeter · July 4, 2018, 10:42pm

Hi, thank you for your recommendation.
This is a good tutorial but I am confused at the moment about the training process, especially the attention.
I used Tensorflow before and I am new to Pytorch. I know the attention should be implemented manually instead of a wrapper.
I am wondering how to train the model. In the tutorial and probably this link, https://github.com/spro/practical-pytorch/blob/master/seq2seq-translation/seq2seq-translation-batched.ipynb, @spro it seems the rnn sequence should also be iterated manually, as it shows something like “for t in range(max_target_length):”. So how about the batch iteration in an epoch? Is it here, ‘for iter in range(1, n_iters + 1):’?

Basically, from my understanding, the training of seq2seq in Pytorch, might be two loops: 1) the batch loop in an epoch, 2) the sequence loop in one batch for feeding word by word until the end of a sequence (maybe max_time). Is it right? I am new to pytorch, and a little bit confused. I tried lstm for mnist classification as beginner program. It seems that the lstm can be used in a sequence style instead of feeding word by word. That is why I get stuck here.

Could you help me with it? @hughperkins @spro

mayhewsw · September 6, 2018, 3:22pm

Curious about something: it seems odd that the attention weights are calculated without looking at encoder_output. Relevant line is:

attn_weights = F.softmax(self.attn(torch.cat((embedded[0], hidden[0]), 1)), dim=1)

This seems weird – do you have an intuition on why this would work? I thought you had to look at the encoder when calculating weights (your newer tutorial versions do use the encoder_outputs as I would expect, btw)