I don’t know if its “normal,” but this kind of thing has been seen before.
See, for example:
It might be worth noting that because you are adding a trailing singleton
dimension (unsqueeze (-1)) to tensor1, you are, in effect, performing
a batch of vector dot products rather than a batch of fully general matrix
Computing a batch of dot products is not a rare use case, but pytorch
does not offer a specialized batch-dot-product function. I’ve come to
conclude that einsum() is a perfectly satisfactory way to compute a
batch-dot-product (and it’s what I use by default when the need arises).
(It’s worth noting that there are instances where einsum() – perhaps with
older versions of pytorch – unreasonably underperforms the equivalent matmul() computation (with various transpose()s and unsqueeze()s
to get the dimensions to line up correctly).)
Perhaps matmul()'s performance tuning has been focused on full matrix
products, rather than the “edge” case of batch dot products. This would
hardly excuse matmul()'s underperformance, but might offer a historical
Or it might be some glitch in matmul()'s broadcasting support. It might
be interesting to perform the comparison when creating tensor1 with
an explicit trailing singleton dimension, rather than using unsqueeze().
(You could also try adding a leading singleton dimension to tensor2.
You would, of course, still be broadcasting bs over tensor2’s singleton
dimension and I don’t think it would be a fair comparison to avoid such