Apply mask softmax

Hi everyone,

I try to implement the following function:

At this stage, I have e.g. a tensor [[1,0,3], [0, 1, 2], [3, 2, 1]] and the softmax should be done for only values greater than 0 (neighbors), thus [[0.12, 0, 0.88], [0, 0.27, 0.73], [0.66, 0.24, 0.1]].

Is there an existing method to do this or does anyone have an idea ?

Thank you for your help !

To clarify: you want to calculate the standard softmax BUT you want to ignore any zero values.

In other words for the first row you remove the zero, then you calculatesoftmax([1,3]), and then you reinsert the zero into the output.

Is that right?

If so, I am stumped. Maybe someone else can help.

Yes exactly ! The zeros have been obtained using torch.where.

I thought about using -10^10 instead of 0 for the soft max but there should be a better way I guess


I think that is a great idea. Anything else would require funny indexing and would probably end up being super inefficient.

Hi Diego. Does this approach work well?

Hi Andy-jqa

Yes absolutely !

Thanks for the reply.
I did something like this:

# matrix A is the one you want to do mask softmax at dim=1
A_max = torch.max(A,dim=1,keepdim=True)[0]
A_exp = torch.exp(A-A_max)
A_exp = A_exp * (A == 0).type(torch.FloatTensor) # this step masks
A_softmax = A_exp / torch.sum(A_exp,dim=1,keepdim=True)

which also seems to work.


Instead of .type(torch.FloatTensor) which will move the data to CPU if it was on the GPU, I would do .float() which will convert it while leaving it on the device it was on, as well as being shorter to type and easier to read.

Great advice! Thanks.

In case you cannot make any assumptions about your data, you could use float('-inf')


Thanks a lot!
And I have a little suggestion:
A_softmax = A_exp /(torch.sum(A_exp,dim=1,keepdim=True)+epsilon)
It can avoid division by zero zero.

I had to implement something similar. My approach was the following (where mask is a tensor of 1s and 0s indicating the entries to be removed):

def masked_softmax(vec, mask, dim=1):
    masked_vec = vec * mask.float()
    max_vec = torch.max(masked_vec, dim=dim, keepdim=True)[0]
    exps = torch.exp(masked_vec-max_vec)
    masked_exps = exps * mask.float()
    masked_sums = masked_exps.sum(dim, keepdim=True)
    zeros=(masked_sums == 0)
    masked_sums += zeros.float()
    return masked_exps/masked_sums

Thanks for your solution. But why do you need to compute zeros and add them to the masked_sums? Without zeros, pytorch can still compute the probability by broadcasting the masked_sums.

I wrote even simpler version:

def masked_softmax(vec, mask, dim=1, epsilon=1e-5):
    exps = torch.exp(vec)
    masked_exps = exps * mask.float()
    masked_sums = masked_exps.sum(dim, keepdim=True) + epsilon
    return (masked_exps/masked_sums)

ehn… directly feed vectors with float('-inf') to pytorch’s default nn.Softmax in fact returns nan at the entries with negative infty.


There is no need to add the epsilon as exp(x) is always larger than 0.

Try this

tsr = torch.Tensor([[1,0,3], [0, 1, 2], [3, 2, 1]]).float()
mask = ((tsr  > 0).float() - 1) * 9999  # for -inf
result = (tsr + mask).softmax(dim=-1)
Here is a solution by filling masked placed with float(’-inf’):

import torch.nn.functional as F
F.softmax(vec.masked_fill((1 - mask).bool(), float('-inf')), dim=1)
What is the logic behind skipping zeroes? Does it improve predictions?

