Is it a good idea to utilize openmp when doing CPU PyTorch distributed training?

The case is that we are doing distributed PyTorch training using the DDP model with CPU clusters.

I wonder if it is a good idea to set the OMP_NUM_THREADS as suggested by the tuning-guild.

My worry is that the training is already a CPU-intensive workload and the use of OpenMp may contend the CPU, which may lead to even-worse performance.

Another question is that if the OMP_NUM_THREADS should be set, then is there a suggested value for it related to the number of logical cores used?

What backend are you guys using? If you are using Gloo this variable is not helping probably. If you are using MPI yes this can have some effects here.

Sorry for the late response.
We are using GLOO backend.
Currently, we set this value to the number of physical cores used for training.
Why this variable is not helping for GLOO?