Why are all threads bound to the same core?

I want to use pytorch DDP module to do the distributed training and I use the OpenBLAS as the BLAS. When I execute the following benchmark

import timeit
runtimes = []
threads = [1] + [t for t in range(2, 49, 2)]
for t in threads:
    r = timeit.timeit(setup = "import torch; x = torch.randn(1024, 1024); y = torch.randn(1024, 1024)", stmt="torch.mm(x, y)", number=100)

I found that different threads were running on different cores.
However when I execute my training script, I found all threads are bound to the same core.

export NUM_CORES=64
COMMAND="$HOME/deepnet_mpi/CosmoFlow.py --epochs=120 --backend=gloo --workers=0 --batch-size=1 --print-freq=50 --data=$HOME/Nbody/datasets/v6"

python3 -m torch.distributed.launch \
--nproc_per_node=$NPROC_PER_NODE \

What is the reason of this problem? And this is my environment:

Collecting environment information...
PyTorch version: 1.6.0a0+b31f58d
Is debug build: No
CUDA used to build PyTorch: None

OS: CentOS Linux release 7.6.1810 (AltArch)
GCC version: (GCC) 9.2.0
CMake version: version 3.16.5

Python version: 3.7
Is CUDA available: No
CUDA runtime version: No CUDA
GPU models and configuration: No CUDA
Nvidia driver version: No CUDA
cuDNN version: No CUDA

Versions of relevant libraries:
[pip3] numpy==1.19.1
[pip3] torch==1.6.0a0+b31f58d
[conda] Could not collect

I found that when multiple processes are set up, different processes also use the same CPU core.

hmm, I am not aware of any DDP code that would change the threading behavior.

cc @VitalyFedyunin do you know what might lead to this behavior?

Thanks for your reply, now I have solved this problem. The reason is that I do not set an openmp environment variable

export GOMP_CPU_AFFINITY=0-127