Why does vectorization provide little to no speedup for matrix_exp?

I’m trying to do some optimizations. My understanding is that, usually, it’s better to vectorize than to do loops. But this doesn’t seem to be the case for matrix_exp. Here are some tests:

matrices = [torch.rand(1000, 1000) for _ in range(16)]

Test using loops:


%%timeit
for m in matrices:
    e = torch.matrix_exp(m)
2.24 s ± 48.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Test with vectorization:


%%timeit
torch.matrix_exp(torch.stack(matrices))
2.21 s ± 16.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

I would love to understand why there’s very little speed-up in this case, and if there’s anything else I can do.

So I don’t know how matrix exp is implemented (it could just be looping over the items), but one general thought that applies to 1000x1000 matrices: Vectorization works best when the individual operations either cannot use all computational facilities (ie fully use the CPU/GPU) or the “administrative overhead” (e.g. creation of Tensor data structures with metadata) has a substantial computation. I would doubt that this is necessarily the case here.

Best regards

Thomas

Late to the party here, but yeah, there’s a part of the implementation that’s basically sequential: pytorch/LinearAlgebra.cpp at 7f18ef14c1fed4e4376a75d626d98ba3c074809c · pytorch/pytorch · GitHub
We could do better there by simply choosing the largest number in the whole batch and perform that many matmuls for the whole batch. I suspect this would work best in real world scenarios.