Why does Pytorch only find one physical CPU?

Problem description:

I compile the pytorch source code in arm machine.And I want to use DDP interface for
distributed training.However, I found that pytorch could only find one physical CPU, which means that my CPU usage cannot exceed 50%.(The machine has two sockets)

My machine contains two physical Cpus, each with 64 cores.

I use OpenBLAS as the BLAS and I compile it with openmp.In the script, I set the environment variable

export OMP_NUM_THREADS=128
export GOMP_CPU_AFFINITY=0-127
export OMP_DISPLAY_ENV=true

Then when I execute my script

python3 -m torch.distributed.launch \
--nproc_per_node=$NPROC_PER_NODE \
script.py 2>&1

Then I found out that all the threads were running on the same CPU core and it ouput:

OPENMP DISPLAY ENVIRONMENT BEGIN
  _OPENMP = '201511'
  OMP_DYNAMIC = 'FALSE'
  OMP_NESTED = 'FALSE'
  OMP_NUM_THREADS = '64'
  OMP_SCHEDULE = 'DYNAMIC'
  OMP_PROC_BIND = 'TRUE'
  OMP_PLACES = '{0},{1},{2},{3},{4},{5},{6},{7},{8},{9},{10},{11},{12},{13},{14},{15},{16},{17},{18},{19},{20},{21},{22},{23},{24},{25},{26},{27},{28},{29},{30},{31},{32},{33},{34},{35},{36},{37},{38},{39},{40},{41},{42},{43},{44},{45},{46},{47},{48},{49},{50},{51},{52},{53},{54},{55},{56},{57},{58},{59},{60},{61},{62},{63},{64},{65},{66},{67},{68},{69},{70},{71},{72},{73},{74},{75},{76},{77},{78},{79},{80},{81},{82},{83},{84},{85},{86},{87},{88},{89},{90},{91},{92},{93},{94},{95},{96},{97},{98},{99},{100},{101},{102},{103},{104},{105},{106},{107},{108},{109},{110},{111},{112},{113},{114},{115},{116},{117},{118},{119},{120},{121},{122},{123},{124},{125},{126},{127}'
  OMP_STACKSIZE = '0'
  OMP_WAIT_POLICY = 'PASSIVE'
  OMP_THREAD_LIMIT = '4294967295'
  OMP_MAX_ACTIVE_LEVELS = '2147483647'
  OMP_CANCELLATION = 'FALSE'
  OMP_DEFAULT_DEVICE = '0'
  OMP_MAX_TASK_PRIORITY = '0'
  OMP_DISPLAY_AFFINITY = 'FALSE'
  OMP_AFFINITY_FORMAT = 'level %L thread %i affinity %A'
OPENMP DISPLAY ENVIRONMENT END

OPENMP DISPLAY ENVIRONMENT BEGIN
  _OPENMP = '201511'
  OMP_DYNAMIC = 'FALSE'
  OMP_NESTED = 'FALSE'
  OMP_NUM_THREADS = '64'
  OMP_SCHEDULE = 'DYNAMIC'
  OMP_PROC_BIND = 'TRUE'
  OMP_PLACES = '{0},{1},{2},{3},{4},{5},{6},{7},{8},{9},{10},{11},{12},{13},{14},{15},{16},{17},{18},{19},{20},{21},{22},{23},{24},{25},{26},{27},{28},{29},{30},{31},{32},{33},{34},{35},{36},{37},{38},{39},{40},{41},{42},{43},{44},{45},{46},{47},{48},{49},{50},{51},{52},{53},{54},{55},{56},{57},{58},{59},{60},{61},{62},{63}'
  OMP_STACKSIZE = '0'
  OMP_WAIT_POLICY = 'PASSIVE'
  OMP_THREAD_LIMIT = '4294967295'
  OMP_MAX_ACTIVE_LEVELS = '2147483647'
  OMP_CANCELLATION = 'FALSE'
  OMP_DEFAULT_DEVICE = '0'
  OMP_MAX_TASK_PRIORITY = '0'
  OMP_DISPLAY_AFFINITY = 'FALSE'
  OMP_AFFINITY_FORMAT = 'level %L thread %i affinity %A'
OPENMP DISPLAY ENVIRONMENT END

Could someone tell me what is the reason? Thanks!

environment

Collecting environment information...
PyTorch version: 1.6.0a0+b31f58d
Is debug build: No
CUDA used to build PyTorch: None

OS: CentOS Linux release 7.6.1810 (AltArch)
GCC version: (GCC) 9.2.0
CMake version: version 3.16.5

Python version: 3.7
Is CUDA available: No
CUDA runtime version: No CUDA
GPU models and configuration: No CUDA
Nvidia driver version: No CUDA
cuDNN version: No CUDA

Versions of relevant libraries:
[pip3] numpy==1.19.1
[pip3] torch==1.6.0a0+b31f58d
[conda] Could not collect

From the DDP Docs, you must do the following when initializing DDP:

For multi-device modules and CPU modules, device_ids must be None or an empty list, and input data for the forward pass must be placed on the correct device.

While you cannot specify which cores to run each process on from PyTorch, you should still be able to specify CPU affinity in general.

Thanks for your reply!
Yeah, I did what they said:

model = torch.nn.parallel.DistributedDataParallel(model, device_ids=[], output_device=[])

I also think I am able to specify CPU affinity,however I am failed.

@VitalyFedyunin @ptrblck is it possible to specify CPU affinity when using PyTorch?

It should be possible to set the CPU affinity using NVML and Tesla GPUs for DDP.
You could probably use pynvml as a convenient Python API to create the affinity list and set it via os.sched_setaffinity.
However, I haven’t played around with it a lot.

1 Like

Thanks for your reply, I have tried it but it did nothing. I implement it just like this:

#set cpu affinity
    pid = 0
    affinity_mask = {i for i in range(128)}
    os.sched_setaffinity(0, affinity_mask)
    print("Number of CPUs:", os.cpu_count())
    affinity = os.sched_getaffinity(pid)
    real_pid = os.getpid()
    print("Now, process {} is eligibl to run on:{}".format(real_pid,affinity))

In fact, It did print the process is eligible to run on CPU:0~127

Now, process 34094  is eligibl to run on:{0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127}

However,when I use ps -eLF to check the threads of this process, it still uses the half of the CPU cores.
I set this environment export OMP_DISPLAY_ENV=true and then the script print the OpenMP message twice.Do you know what are the OpenMP calls in there two places?
As shown above,the first time OMP_PLACES is 0-127,but the second time it becomes 0-63.

If I do not use DDP and just execute the py script,then I found there is just one OpenMP message.

@khalil Could you describe your DDP setup? Are you running DDP on a single machine with multiple processes, if so how many processes per host? Or are you running DDP across multiple machines here?

1 Like

I am sorry for taking so long time to reply you. I found the problem is not caused by DDP. It is caused by the __init__.py file in the torch directory.

I try to avoid load this __init__.py and the problem is disappeared. I think there are some library problem in my machine. Thanks for your reply.