when I run my model on the CPU, the model occupies all cpu cores in default. And I export the OMP_NUM_THREADS=1, it almost takes the same time for the same input. So I wander that why the former which use all cpu cores makes no improvement over the latter?
And I attempt to install from source or binary, but no change. And the OS is CentOS Linux release 126.96.36.199
What’s your PyTorch version? Also, did you try running e.g. with 4 OMP threads? I think the problem appears because when you’re using all the cores, they’re competing over cache space and can’t proceed as effectively.
My pytorch version is 0.1.10. And as you say, set the OMP_NUM_THREADS=4 and it works well. Thanks for your reply.