Why does vectorization provide little to no speedup for matrix_exp?

I’m trying to do some optimizations. My understanding is that, usually, it’s better to vectorize than to do loops. But this doesn’t seem to be the case for matrix_exp. Here are some tests:

matrices = [torch.rand(1000, 1000) for _ in range(16)]

Test using loops:

for m in matrices:
    e = torch.matrix_exp(m)
2.24 s ± 48.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Test with vectorization:

2.21 s ± 16.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

I would love to understand why there’s very little speed-up in this case, and if there’s anything else I can do.

So I don’t know how matrix exp is implemented (it could just be looping over the items), but one general thought that applies to 1000x1000 matrices: Vectorization works best when the individual operations either cannot use all computational facilities (ie fully use the CPU/GPU) or the “administrative overhead” (e.g. creation of Tensor data structures with metadata) has a substantial computation. I would doubt that this is necessarily the case here.

Best regards