Confused about Distributed data parallel behavior

I am trying to run my model using DDP on a single node with 3 GPUs. I only intend to use two GPUs so I used os.environ["CUDA_VISIBLE_DEVICES"]="1,2".

I started running the code with only one process
python -m torch.distributed.launch --nproc_per_node=1 train.py

The code runs but when i check nvidia-smi i see a processes running on two gpus.
image
Process 62966 corresponds to the DDP. Ignore process 2815. I did not understand why the process is running on GPU 2 but without using any resource. The code seems to run alright in this case.

When i run the command with two processes
python -m torch.distributed.launch --nproc_per_node=2 train.py
image
I see two processes are created but both the process run on both the gpus. The code does not run and stuck at torch.distributed.barrier().

my code for init_process_group:

def init_distributed_mode(args):
    args.rank = int(os.environ["RANK"])
    args.world_size = int(os.environ['WORLD_SIZE'])
    args.gpu = int(os.environ['LOCAL_RANK'])
    args.distributed = True
    torch.cuda.device(args.gpu)
    args.dist_backend = 'nccl'
    torch.distributed.init_process_group(backend=args.dist_backend, init_method=args.dist_url,world_size=args.world_size, rank=args.rank)               
    torch.distributed.barrier()

How do I limit one process to one gpu and make the code not stuck at torch distributed barrier?

Hi,

Could you try torch.cuda.set_device() instead, torch.cuda.device is a context manager, also see https://github.com/pytorch/pytorch/issues/1608

Thanks, the code started running, no longer stuck at the distributed barrier but the nvidia-smi output still doesn’t make sense to me. Two processes are running on both the GPUs.
image
Is this the expected behavior or something’s wrong?

How do you initialize DDP, do you provide the correct device to it? e.g.

ddp_model = DDP(model, device_ids=[rank])

args.gpu = int(os.environ['LOCAL_RANK'])
model = torch.nn.parallel.DistributedDataParallel(model, device_ids=[args.gpu])

I just run a toy example on my machine with 2 processes and i got:

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0    882788      C   ...cal/miniconda3/envs/pytorch3/bin/python   945MiB |
|    1    882789      C   ...cal/miniconda3/envs/pytorch3/bin/python   945MiB |
+-----------------------------------------------------------------------------+