The case is that we are doing distributed PyTorch training using the DDP model with CPU clusters.
I wonder if it is a good idea to set the OMP_NUM_THREADS
as suggested by the tuning-guild.
My worry is that the training is already a CPU-intensive workload and the use of OpenMp
may contend the CPU, which may lead to even-worse performance.
Another question is that if the OMP_NUM_THREADS
should be set, then is there a suggested value for it related to the number of logical cores used?