Hello Christos,
With
attentions = Variable(torch.randn(5,10).cuda())
max_len = attentions.size(1)
lengths = ((torch.arange(0,5)+5).long().cuda())
as an example, I think you can do this
faster if you do
idxes = torch.arange(0,max_len,out=torch.LongTensor(max_len)).unsqueeze(0).cuda() # some day, you'll be able to directly do this on cuda
mask = Variable((idxes<lengths.unsqueeze(1)).float())
(works on master / 0.2, you need to expand_as(attention) or so on 0.1.12)
If you multiply the output by 0 with the mask, this will propagate a gradient of 0 to attention
. I think this is the right way in general, but I don’t have the expertise to say how best handle the end of the sequence…
Best regards
Thomas