Multithreaded Matrix Multiplication speed

My code uses a lot of batched matrix multiplications of small matricies, but I get no speedup for using more than one thread on a 4 core machine (also tested on a 6 core and a 64 core cluster node)

So I ran a test and found the following (only on the 4 core)

>import timeit
>print(timeit.timeit("a*b",setup="import torch; torch.set_num_threads(1); torch.set_num_interop_threads(1); a=torch.normal(mean=torch.zeros((1000000,2,2)),std=1.0);  b=torch.normal(mean=torch.zeros((1000000,2,2)),std=1.0)",number=10000))
75.0120605

cpu usage was at roughly 23% for python

then I ran the following:

>import timeit
>print(timeit.timeit("a*b",setup="import torch; torch.set_num_threads(4); torch.set_num_interop_threads(4); a=torch.normal(mean=torch.zeros((1000000,2,2)),std=1.0);  b=torch.normal(mean=torch.zeros((1000000,2,2)),std=1.0)",number=10000))
61.3748576

cpu usage was >85% for python and at 100% for the whole system.

I reproduced this with similar results a couple of times and I also tried using different values of threads and interop threads without any improvement.

While this is at least some speedup( which I do not get for my code), this is not what I expected for using 4 cores instead of one.

Is this expected?

I need my code to scale, so that it can use the 64 cores of the cluster node efficiently.
There are some loops that I could manually parallelize, but I would prefer if there was an effiecient way to just run the batched matrix multiplications in parallel.