Any build-in function to avoid apply an expensive softmax on padding tokens in a sequence

Zheng_Chen · February 25, 2020, 6:06pm

This is not about how to get distribution (like attention weights) over a sequence excluding the paddings so that the normalization is only done over non-padding part of the sequence.

My problem is this, support you have a vector (embedding) sequence
$x_1$, $x_2$, $x_3$, …, $x_n$, $p_1$, $p_2$, …, $p_m$

and I need to generate a probability distribution by softmax for each of the non-padding ones x1, x2, x3, …, xn. I don’t need distribution over the paddings p1, …, pm (usually they are all zero vectors)

Of course, I can generate the distributions for all then discard the last m distributions.
But if the distribution is large, like a vocabulary distribution, 30000+, then softmax is very slow.
and the sequence lengths vary a lot and so there are a lot of paddings in a batch

Is there a build-in way to avoid computing softmax on the useless paddings ?