this one: http://pytorch.org/tutorials/intermediate/seq2seq_translation_tutorial.html
I agree it was a cool tutorial, but I also think it could have been better:
so that means its not really usable as is? does it need to been fixed/redone?
Both the encoder and the decoder use a loop in the number of layers, which basically forces the network to share weights across layers. This is not usual, and if it was intended, it could have been explicit in the text.
The number of layer is set to 1, so the loop could be simply removed and it would still work properly.
It could have dealt with attention better. It basically admits that all sentences have the same length, and if they donât, then they wonât use all attention weights. This basically means that the attention weights used may not sum to 1, as it is supposed.
All the sentences do have the same lengths, after pre-processing the data. But it is free to us to extend this tutorial with variable lengths, in order to explain the pack_padding trick, using an RNN cell for attention (instead of the linear layer). That would be great !
Check the updated https://github.com/spro/practical-pytorch/blob/master/seq2seq-translation/seq2seq-translation.ipynb for solutions to both those problems.
n_layers
argument of the nn.GRU
constructor.I keep meaning to merge this into the official tutorials repoâŚ
@spro Let me know if you need any help merging.
The number of layer is set to 1, so the loop could be simply removed and it would still work properly.
Agree, but just because n_layers = 1. Otherwise it would be weird.
All the sentences do have the same lengths, after pre-processing the data.
True, but the smaller sentences were padded, so part of the attention is in the padding, which is not what you want.
Thanks @spro!
Just one question, is there a good way to mask the attention? Even using the attention from the paper from Manning you mentioned, if the sentences in a batch are of different lengths, you will need to mask some positions. Right now I am directly filling the attention matrix (before applying softmax) with -float(âinfâ), but this seems a bit hacky.
Just noticed, this doesnt share the embedding across encoder-input/decoder-input/output? Per https://arxiv.org/abs/1608.05859 might be interesting to do that? Also, this is used by the âattention is all you needâ paper, https://arxiv.org/abs/1706.03762 . Thoughts on how this would be implemented in a pytorch-idiomatic way? (I can put this into a separate/new thread/topic/question perhaps?)
It is certainly quite an interesting tutorial. @spro one thing that I think can be added is running the layers on cuda when cuda is available (USE_CUDA = True
). In the current version, when cuda is available, the Variables run on cuda but the model/layers donât.
afaik weight sharing doesnât make sense for translation because the embeddings on the encoder and decoder side are totally different (two different languages)⌠for cases like CharRNN or WordRNN it makes sense and is very (almost too) easy to implement - see https://github.com/pytorch/examples/blob/f2a771a8a2f3a38ec15b11f6f19ac38c8bbaa900/word_language_model/model.py#L28-L31
Edit: it does make sense on the decoder side inputs & outputs
well, per the paper above, french and english share enough words that it worked ok for them.
As far as copying the weights, I think that means the second embedding has already been allocated. And then we throw it away. That seems a bit âuncleanâ to me. What if the embedding is huge?
Would prefer to be able to create the embedding once, then re-use it. In idiomatic pytorch.
Couldnât we use shared embeddings just like that?
class Seq2Seq(nn.Module):
def __init__(self):
super(Seq2Seq, self).__init__()
self.embed_seq2seq= nn.Embedding(len(vocab), 75, padding_idx=vocab("<pad>"))
self.lstm_enc = nn.LSTM(75, 150, 2, batch_first=True, bidirectional=False)
self.lstm_dec = nn.LSTM(75, 150, 2, batch_first=True, bidirectional=False)
self.linear = nn.Linear(150, 1)
And then in the forward function use self.embed_seq2seq
for encoder and decoder
Yes, maybe
Somewhat related question: how to get the output of the encoder and decoder in a pytorch idiomatic way, and handling teacher forcing etc?
soemthing like? :
def forward(self, encoder_input, decoder_input, state):
if encoder_input is not None:
enc_input_emb = self.embed_seq2seq(encoder_input)
enc_out, state = self.lstm_enc(enc_input_emb, state)
if decoder_input is not None:
decoder_input_emb = self.embed_seq2seq(decoder_input)
dec_out, state = self.lstm_dec(decoder_input_emb, state)
embedding_size = self.embed_seq2seq.size()[1]
batch_size = decoder_input.size()[1]
seq_len = decoder_input.size()[0]
dec_out_unemb = dec_out.view(-1, embedding_size) @ self.embed_seq2seq.weight.transpose(0, 1)
dec_out_unemb = dec_out_unemb.view(seq_len, batch_size, -1)
return enc_out, dec_out, dec_out_unemb, state
?
Here you go. A pytorch implementation of the model in âAttention is All You Needâ.
Thanks for the reply, can you also explain why attn_combine is a cat of att_applied and embeded; I did it with att_applied and prev hidden instead of embeded and it works quite simmilar.
Hi, thank you for your recommendation.
This is a good tutorial but I am confused at the moment about the training process, especially the attention.
I used Tensorflow before and I am new to Pytorch. I know the attention should be implemented manually instead of a wrapper.
I am wondering how to train the model. In the tutorial and probably this link, https://github.com/spro/practical-pytorch/blob/master/seq2seq-translation/seq2seq-translation-batched.ipynb, @spro it seems the rnn sequence should also be iterated manually, as it shows something like âfor t in range(max_target_length):â. So how about the batch iteration in an epoch? Is it here, âfor iter in range(1, n_iters + 1):â?
Basically, from my understanding, the training of seq2seq in Pytorch, might be two loops: 1) the batch loop in an epoch, 2) the sequence loop in one batch for feeding word by word until the end of a sequence (maybe max_time). Is it right? I am new to pytorch, and a little bit confused. I tried lstm for mnist classification as beginner program. It seems that the lstm can be used in a sequence style instead of feeding word by word. That is why I get stuck here.
Could you help me with it? @hughperkins @spro
Curious about something: it seems odd that the attention weights are calculated without looking at encoder_output. Relevant line is:
attn_weights = F.softmax(self.attn(torch.cat((embedded[0], hidden[0]), 1)), dim=1)
This seems weird â do you have an intuition on why this would work? I thought you had to look at the encoder when calculating weights (your newer tutorial versions do use the encoder_outputs as I would expect, btw)