I want to use pytorch DDP module to do the distributed training and I use the OpenBLAS as the BLAS. When I execute the following benchmark
import timeit runtimes =  threads =  + [t for t in range(2, 49, 2)] for t in threads: torch.set_num_threads(t) r = timeit.timeit(setup = "import torch; x = torch.randn(1024, 1024); y = torch.randn(1024, 1024)", stmt="torch.mm(x, y)", number=100) runtimes.append(r)
I found that different threads were running on different cores.
However when I execute my training script, I found all threads are bound to the same core.
export GLOO_SOCKET_IFNAME=ib0 export NUM_CORES=64 export OMP_NUM_THREADS=$NUM_CORES NPROC_PER_NODE=1 COMMAND="$HOME/deepnet_mpi/CosmoFlow.py --epochs=120 --backend=gloo --workers=0 --batch-size=1 --print-freq=50 --data=$HOME/Nbody/datasets/v6" python3 -m torch.distributed.launch \ --nproc_per_node=$NPROC_PER_NODE \ $COMMAND
What is the reason of this problem? And this is my environment:
Collecting environment information... PyTorch version: 1.6.0a0+b31f58d Is debug build: No CUDA used to build PyTorch: None OS: CentOS Linux release 7.6.1810 (AltArch) GCC version: (GCC) 9.2.0 CMake version: version 3.16.5 Python version: 3.7 Is CUDA available: No CUDA runtime version: No CUDA GPU models and configuration: No CUDA Nvidia driver version: No CUDA cuDNN version: No CUDA Versions of relevant libraries: [pip3] numpy==1.19.1 [pip3] torch==1.6.0a0+b31f58d [conda] Could not collect
I found that when multiple processes are set up, different processes also use the same CPU core.