The way to implement attention-mask/uni-direction attention in TransformerDecoder

AOZMH · March 13, 2020, 12:39pm

Hi guys,

I’m learning about nn.Transformer in pytorch these days and I’m a bit confused about the implementation of the attention mask in decoder. Say we’re doing a machine translation task using Transformer, when inferencing, the output of each time step can only “see” the tokens before it. However, when training, we simply feed the correct sequence to the decoder and thus a token in the decoded sequence can see the tokens both before and after it, which I guess is not suitable for a robust model.

I know that GPT adopted single directional attention when decoding (and I guess it’s suitable for a decoder), but I’m wondering if the APIs in nn.Transformer can realize such feature (e.g. using attention masks for different tokens to avoid it from attending on later tokens)? If so, what should I do?

Any help would be appreciated, thanks in advance!

zhangguanheng66 · March 13, 2020, 2:46pm

You should set up the attention mask for the decoder to mask the token from see the one before it.

AOZMH · March 13, 2020, 3:10pm

Thanks for the reply! Could you please offer more concrete hints, e.g. ways to setup attention masks, sample codes or a link that provide these? Many thanks!

zhangguanheng66 · March 13, 2020, 4:18pm

github.com

pytorch/examples/blob/5551061414d3bcf202de520d20e8163f58eb664a/word_language_model/model.py#L126


    self.src_mask = None
    self.pos_encoder = PositionalEncoding(ninp, dropout)
    encoder_layers = TransformerEncoderLayer(ninp, nhead, nhid, dropout)
    self.transformer_encoder = TransformerEncoder(encoder_layers, nlayers)
    self.encoder = nn.Embedding(ntoken, ninp)
    self.ninp = ninp
    self.decoder = nn.Linear(ninp, ntoken)


    self.init_weights()


def _generate_square_subsequent_mask(self, sz):
    mask = (torch.triu(torch.ones(sz, sz)) == 1).transpose(0, 1)
    mask = mask.float().masked_fill(mask == 0, float('-inf')).masked_fill(mask == 1, float(0.0))
    return mask


def init_weights(self):
    initrange = 0.1
    self.encoder.weight.data.uniform_(-initrange, initrange)
    self.decoder.bias.data.zero_()
    self.decoder.weight.data.uniform_(-initrange, initrange)

AOZMH · March 14, 2020, 2:44pm

Many thanks for the code, it’s much clearer right now!