My code uses a lot of batched matrix multiplications of small matricies, but I get no speedup for using more than one thread on a 4 core machine (also tested on a 6 core and a 64 core cluster node)
So I ran a test and found the following (only on the 4 core)
>import timeit
>print(timeit.timeit("a*b",setup="import torch; torch.set_num_threads(1); torch.set_num_interop_threads(1); a=torch.normal(mean=torch.zeros((1000000,2,2)),std=1.0); b=torch.normal(mean=torch.zeros((1000000,2,2)),std=1.0)",number=10000))
75.0120605
cpu usage was at roughly 23% for python
then I ran the following:
>import timeit
>print(timeit.timeit("a*b",setup="import torch; torch.set_num_threads(4); torch.set_num_interop_threads(4); a=torch.normal(mean=torch.zeros((1000000,2,2)),std=1.0); b=torch.normal(mean=torch.zeros((1000000,2,2)),std=1.0)",number=10000))
61.3748576
cpu usage was >85% for python and at 100% for the whole system.
I reproduced this with similar results a couple of times and I also tried using different values of threads and interop threads without any improvement.
While this is at least some speedup( which I do not get for my code), this is not what I expected for using 4 cores instead of one.
Is this expected?
I need my code to scale, so that it can use the 64 cores of the cluster node efficiently.
There are some loops that I could manually parallelize, but I would prefer if there was an effiecient way to just run the batched matrix multiplications in parallel.