How to use multi-cpu or muti-cpu core to train

Hi,

as we know, we could use CUDA_VISIBLE_DEVICES env and torch.distributed to run training on multi-GPU.
is there any way to use multi-CPU or multi-CPU core to run parallel training?

Multiple CPU cores can be used in different libraries such as MKL etc. and can be set via the env variables:

OMP_NUM_THREADS=nb
MKL_NUM_THREADS=nb

or via:

torch.set_num_interop_threads() # Inter-op parallelism
torch.set_num_threads() # Intra-op parallelism

@ptrblck ,

When we train model with multi-GPU, we usually use command:
CUDA_VISIBLE_DEVICES=0,1,2,3 WORLD_SIZE=4 python -m torch.distributed.launch --nproc_per_node=4 train.py --bs 16.

if we use the upper command and corresponding in code, we could run parallel training on multi-GPU. My question is: is there any similar method to run training on CPU like GPU?

Based on this doc it seems gloo might be the right choice, but I’m not deeply familiar with it:

Rule of thumb

  • Use the NCCL backend for distributed GPU training
  • Use the Gloo backend for distributed CPU training.

Thanks for your question, I guess @jiayisuse might also want to comment on this.

We will need to use Gloo for CPU training. Need to call torch.distributed.init_process_group(backend=‘gloo’) in your code

@jiayisuse ,
Thank you!
Thank @fduwjj , @ptrblck !
Is there any further doc or sampler code about how to use Gloo for CPU training?

I think the API looks similar after you have created the process group. Here we have all supported collectives for gloo (and all others): Distributed communication package - torch.distributed — PyTorch 1.11.0 documentation.