Why is pytorch using only the half of cpu cores (SMT enabled)?

I’ve searched why, and it seems to be related to simultaneous multithreading (SMT) and OpenMP. OMP_NUM_THREADS is (num of cpu cores) / 2 by default(?). Is this behavior intended in pytorch? I don’t think that this will help increase performance…

PyTorch has intra-operand and inter-operand parallelism. This means for a given op, you’d want not necessarily want to use all threads. If you have an application where you know you don’t need the latter, you can adjust the defaults.

Best regards

Thomas

Thank you for answer! So it is an intended behavior. By the way, I don’t understand what intra-operand and inter-operand are. Could you please explain more?

If you have 4 cores and need to do, say, 8 matrix multiplications (with separate data) you could use 4 cores to do each matrix multiplication (intra-op-parallelism). Or you could use a single core for each op and run 4 of them in parallel (inter-op-parallelism).
In training, you also might want to have some cores for the dataloader, for inference, the JIT can parallelize things (I think).
The configuration is documented here, but without much explanation: https://pytorch.org/docs/stable/torch.html#parallelism

Best regards

Thomas

2 Likes

Hi, thanks for the great discussion!

Is the choice between intra-op and inter-op determined by PyTorch internally? For me, I found that setting num_threads or num_interop_thread to a larger number does not change the run time (or even slow down the run time), even though more CPU cores were involved in the computing.