Confused about Distributed data parallel behavior

I am trying to run my model using DDP on a single node with 3 GPUs. I only intend to use two GPUs so I used os.environ["CUDA_VISIBLE_DEVICES"]="1,2".

I started running the code with only one process
python -m torch.distributed.launch --nproc_per_node=1

The code runs but when i check nvidia-smi i see a processes running on two gpus.
Process 62966 corresponds to the DDP. Ignore process 2815. I did not understand why the process is running on GPU 2 but without using any resource. The code seems to run alright in this case.

When i run the command with two processes
python -m torch.distributed.launch --nproc_per_node=2
I see two processes are created but both the process run on both the gpus. The code does not run and stuck at torch.distributed.barrier().

my code for init_process_group:

def init_distributed_mode(args):
    args.rank = int(os.environ["RANK"])
    args.world_size = int(os.environ['WORLD_SIZE'])
    args.gpu = int(os.environ['LOCAL_RANK'])
    args.distributed = True
    args.dist_backend = 'nccl'
    torch.distributed.init_process_group(backend=args.dist_backend, init_method=args.dist_url,world_size=args.world_size, rank=args.rank)               

How do I limit one process to one gpu and make the code not stuck at torch distributed barrier?


Could you try torch.cuda.set_device() instead, torch.cuda.device is a context manager, also see

Thanks, the code started running, no longer stuck at the distributed barrier but the nvidia-smi output still doesn’t make sense to me. Two processes are running on both the GPUs.
Is this the expected behavior or something’s wrong?

How do you initialize DDP, do you provide the correct device to it? e.g.

ddp_model = DDP(model, device_ids=[rank])

args.gpu = int(os.environ['LOCAL_RANK'])
model = torch.nn.parallel.DistributedDataParallel(model, device_ids=[args.gpu])

I just run a toy example on my machine with 2 processes and i got:

| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|    0    882788      C   945MiB |
|    1    882789      C   945MiB |