How much of a performance difference is there between Using ATen and MKL?

Today I compile the source code without MKL Library but I set USE_OPENMP=1.However when I execute the following code:

import timeit
import torch
runtimes = []
threads = [1] + [t for t in range(2, 49, 2)]
for t in threads:
    torch.set_num_threads(t)
    r = timeit.timeit(setup = "import torch; x = torch.randn(1024, 1024); y = torch.randn(1024, 1024)", stmt="torch.mm(x, y)", number=100)
    runtimes.append(r)
# ... plotting (threads, runtimes) ...

I found my machine has almost only one core working.My machine has 24 cores, but when I set the thread number to 12, only one core has a load of 100% and the others are about 2% ~ 10%.
However when I compile the source code with MKL, the situation is different. The thread parallelism is great and many cores have a load of 100%.

Can I use ATen to achieve thread parallelism?