I’ve searched why, and it seems to be related to simultaneous multithreading (SMT) and OpenMP. OMP_NUM_THREADS is (num of cpu cores) / 2 by default(?). Is this behavior intended in pytorch? I don’t think that this will help increase performance…
PyTorch has intra-operand and inter-operand parallelism. This means for a given op, you’d want not necessarily want to use all threads. If you have an application where you know you don’t need the latter, you can adjust the defaults.
Thank you for answer! So it is an intended behavior. By the way, I don’t understand what intra-operand and inter-operand are. Could you please explain more?
If you have 4 cores and need to do, say, 8 matrix multiplications (with separate data) you could use 4 cores to do each matrix multiplication (intra-op-parallelism). Or you could use a single core for each op and run 4 of them in parallel (inter-op-parallelism).
In training, you also might want to have some cores for the dataloader, for inference, the JIT can parallelize things (I think).
The configuration is documented here, but without much explanation: https://pytorch.org/docs/stable/torch.html#parallelism