Proper way to mask softmax/log_softmax output

Jie_Han_Chen · April 17, 2018, 4:04pm

Hello, everyone!

I want to ask “How do we mask softmax output from neural network?”

In some case, like reinforcement learning, we just can do some constraint actions and we will sample the action from softmax/log_softmax output. So, we need to mask the condition which it won’t happen.

When I use mask tensor like [0. 0. 0. 1. 0. 1.] (FloatTensor) to multiply softmax output, it will sometimes comes to nan/-inf. Futhermore, it will cause runtime error: cuda runtime error (59) : device-side assert.

How do we mask the softmax/log_softmax output appropriately?

Thank you.

tom · April 17, 2018, 4:49pm

Typical ways include boolean indexing with the 1/0 array, using where (in master) or just clamp’ing the infs (won’t help for NaN).

Best regards

Thomas

kim-tom · August 21, 2018, 3:01am

I have same issues. Masking output of softmax will sometimes comes to nan / -inf.
I found out this is because the output of softmax is too small to float type.
The output of my NN is about -30. Sum of e^(-30) is nearly 0, which makes inf and nan.

nikhilweee · February 6, 2019, 7:26am

Here’s one implementation I found to be useful.

github.com

allenai/allennlp/blob/b6cc9d39651273e8ec2a7e334908ffa9de5c2026/allennlp/nn/util.py#L272-L303


def masked_log_softmax(vector: torch.Tensor, mask: torch.Tensor, dim: int = -1) -> torch.Tensor:
"""
``torch.nn.functional.log_softmax(vector)`` does not work if some elements of ``vector`` should be
masked.  This performs a log_softmax on just the non-masked portions of ``vector``.  Passing
``None`` in for the mask is also acceptable; you'll just get a regular log_softmax.


``vector`` can have an arbitrary number of dimensions; the only requirement is that ``mask`` is
broadcastable to ``vector's`` shape.  If ``mask`` has fewer dimensions than ``vector``, we will
unsqueeze on dimension 1 until they match.  If you need a different unsqueezing of your mask,
do it yourself before passing the mask into this function.


In the case that the input vector is completely masked, the return value of this function is
arbitrary, but not ``nan``.  You should be masking the result of whatever computation comes out
of this in that case, anyway, so the specific values returned shouldn't matter.  Also, the way
that we deal with this case relies on having single-precision floats; mixing half-precision
floats with fully-masked vectors will likely give you ``nans``.


If your logits are all extremely negative (i.e., the max value in your logit vector is -50 or
lower), the way we handle masking here could mess you up.  But if you've got logit values that
extreme, you've got bigger problems than this.

This file has been truncated. show original