I am trying to run my model using DDP on a single node with 3 GPUs. I only intend to use two GPUs so I used
I started running the code with only one process
python -m torch.distributed.launch --nproc_per_node=1 train.py
The code runs but when i check
nvidia-smi i see a processes running on two gpus.
Process 62966 corresponds to the DDP. Ignore process 2815. I did not understand why the process is running on GPU 2 but without using any resource. The code seems to run alright in this case.
When i run the command with two processes
python -m torch.distributed.launch --nproc_per_node=2 train.py
I see two processes are created but both the process run on both the gpus. The code does not run and stuck at
my code for init_process_group:
def init_distributed_mode(args): args.rank = int(os.environ["RANK"]) args.world_size = int(os.environ['WORLD_SIZE']) args.gpu = int(os.environ['LOCAL_RANK']) args.distributed = True torch.cuda.device(args.gpu) args.dist_backend = 'nccl' torch.distributed.init_process_group(backend=args.dist_backend, init_method=args.dist_url,world_size=args.world_size, rank=args.rank) torch.distributed.barrier()
How do I limit one process to one gpu and make the code not stuck at torch distributed barrier?