Hi,
I would like to start 2 processes on my computer with 2 GPUs. spawn function is used to start 2 processes.
Question 1:
how to specify rank number for each process when I use spawn function to start main_worker?
Question 2:
how to specify/check local_rank of each process in main_worker?
Question 3:
world_size means total number of GPUs used in the processes.
rank is the index of each processes/process number.
local_rank means the GPU index used in the process(rank).
is my understanding correct?
My test code is pasted as follows.
import torch
import torch.distributed as dist
import torch.multiprocessing as mp
mp.spawn(main_worker, nprocs=2, args=(2, myargs))
def main_worker(proc, nprocs, args):
dist.init_process_group(backend='nccl', init_method='tcp://127.0.0.1:23456', world_size=2, rank=gpu)
torch.cuda.set_device(args.local_rank)
train_dataset = ...
train_sampler = torch.utils.data.distributed.DistributedSampler(train_dataset)
train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=..., sampler=train_sampler)
model = ...
model = torch.nn.parallel.DistributedDataParallel(model, device_ids=[args.local_rank])
optimizer = optim.SGD(model.parameters())
for epoch in range(100):
for batch_idx, (data, target) in enumerate(train_loader):
images = images.cuda(non_blocking=True)
target = target.cuda(non_blocking=True)
...
output = model(images)
loss = criterion(output, target)
...
optimizer.zero_grad()
loss.backward()
optimizer.step()