This is not about how to get distribution (like attention weights) over a sequence excluding the paddings so that the normalization is only done over non-padding part of the sequence.

**My problem is this**, support you have a vector (embedding) sequence

$x_1$, $x_2$, $x_3$, …, $x_n$, $p_1$, $p_2$, …, $p_m$

and I need to generate a probability distribution by softmax for each of the non-padding ones x1, x2, x3, …, xn. I don’t need distribution over the paddings p1, …, pm (usually they are all zero vectors)

Of course, I can generate the distributions for all then discard the last m distributions.

But if the distribution is large, like a vocabulary distribution, 30000+, then softmax is very slow.

and the sequence lengths vary a lot and so there are a lot of paddings in a batch

**Is there a build-in way to avoid computing softmax on the useless paddings** ?