Self-Attention (on words) and masking

Hello Christos,

With

attentions = Variable(torch.randn(5,10).cuda())
max_len = attentions.size(1)
lengths = ((torch.arange(0,5)+5).long().cuda())

as an example, I think you can do this

faster if you do

idxes = torch.arange(0,max_len,out=torch.LongTensor(max_len)).unsqueeze(0).cuda() # some day, you'll be able to directly do this on cuda
mask = Variable((idxes<lengths.unsqueeze(1)).float())

(works on master / 0.2, you need to expand_as(attention) or so on 0.1.12)

If you multiply the output by 0 with the mask, this will propagate a gradient of 0 to attention. I think this is the right way in general, but I don’t have the expertise to say how best handle the end of the sequence…

Best regards

Thomas

9 Likes