Self-Attention (on words) and masking

tom · August 2, 2017, 6:44pm

Hello Christos,

With

attentions = Variable(torch.randn(5,10).cuda())
max_len = attentions.size(1)
lengths = ((torch.arange(0,5)+5).long().cuda())

as an example, I think you can do this

cbaziotis:

        # create mask based on the sentence lengths
        mask = Variable(torch.ones(attentions.size())).cuda()
        for i, l in enumerate(lengths):  # skip the first sentence
            if l < max_len:
                mask[i, l:] = 0

faster if you do

idxes = torch.arange(0,max_len,out=torch.LongTensor(max_len)).unsqueeze(0).cuda() # some day, you'll be able to directly do this on cuda
mask = Variable((idxes<lengths.unsqueeze(1)).float())

(works on master / 0.2, you need to expand_as(attention) or so on 0.1.12)

If you multiply the output by 0 with the mask, this will propagate a gradient of 0 to attention. I think this is the right way in general, but I don’t have the expertise to say how best handle the end of the sequence…

Best regards

Thomas