I noticed that pytorch is slower when I set the number of threads to more than 1 (on cpu) with the following line of code:
torch.set_num_threads(30)
I was wondering if anyone has had the same issue?
I noticed that pytorch is slower when I set the number of threads to more than 1 (on cpu) with the following line of code:
torch.set_num_threads(30)
I was wondering if anyone has had the same issue?
@Rojin how many cores do your cpu have??
To know maximum number of threads that your pc can run simultaneously use this formula:-
Number of threads = number of cores * 2
The reason could be you use more number of threads(in your case, 30) than the actual number of threads that can be processed parallely by your machine.
Do
nproc
which gives the Core(s) per socket * Thread(s) per core
Pass that value to torch.set_num_threads().
You can also use torch.get_num_threads() to get the number of threads which will be used for parallelizing.
P.S: Not sure how to interpret the output given by torch.get_num_threads().
Thanks for your responses. I have 56 cores. which mean 112 threads. I set the number of threads to 112, and it defiantly slowed down the speed compared to using only 1 thread
I am facing the same dilemma, Have you got to know the reason behind this?
Firstly, you must confirm there are many threads which run on the different CPUs.
Then you should be attention to memory latency, if your machine is based on numa.
In Multiprocessing best practices — PyTorch 2.6 documentation, it talks about torch.set_num_threads(floor(N/M))
If i understand the article right,
N = os.cpu_count() # Number of vCPUs available on the machine
M = args.num_processes # Number of processes to use, passed as argument
Is it better to have small M less processes but more threads per process, or big M more processes but less threads per process?
How does number of process or number of threads per process affect data loading and subsequent augmentation/computation speed?
If computer has:
Thread(s) per core: 2
Core(s) per socket: 12
Should M value be 12? (Determined by Core(s) per socket)
Is M = 11 wasting 2 threads per core?
I assume the 2 threads per core from the 12th core cannot be re-assigned to the other 11 cores, thus wastage.
Or is this assumption debunked by this “thread-migration” concept mentioned in https://www.intel.com/content/www/us/en/docs/vtune-profiler/cookbook/2023-0/os-thread-migration.html ?
Is M = 13 impossible? (If the code doesn’t break, what is happening under the hood?)