Distributed training creates a lot of processes on other GPUs with memory 0

Elegant_Lin · December 15, 2020, 7:11am

Hi, everyone,

I used DistributedDataParallel and nccl as backend in my code. I ran my code as python3 -m torch.distributed.launch --nproc_per_node=8 train.py. However, Pytorch creates a lot of redundant processes on other GPUs with memory 0 as shown below.

I used 8 * 3090 and Pytorch 1.7. It creates 8 processes on every single GPU.

Does anyone know the reason? Thanks!

mrshenli · December 15, 2020, 3:25pm

Looks like each process is using multiple GPUs. Is this expected? If not, can you try setting CUDA_VISIBLE_DEVICES env var properly for each process before creating any CUDA context?

Elegant_Lin · December 15, 2020, 3:44pm

Hi, mrshenli,

Thanks for your reply. I did not know how to set CUDA_VISIBLE_DEVICES for each process. Could you please give me more information?

Thanks a lot!

wqm · December 15, 2021, 8:17am

Did you solve this problem?