How to use spawn to start sub-process


I would like to start 2 processes on my computer with 2 GPUs. spawn function is used to start 2 processes.

Question 1:
how to specify rank number for each process when I use spawn function to start main_worker?

Question 2:
how to specify/check local_rank of each process in main_worker?

Question 3:
world_size means total number of GPUs used in the processes.
rank is the index of each processes/process number.
local_rank means the GPU index used in the process(rank).
is my understanding correct?

My test code is pasted as follows.

import torch
import torch.distributed as dist
import torch.multiprocessing as mp

mp.spawn(main_worker, nprocs=2, args=(2, myargs))

def main_worker(proc, nprocs, args):

   dist.init_process_group(backend='nccl', init_method='tcp://', world_size=2, rank=gpu)

   train_dataset = ...
   train_sampler =

   train_loader =, batch_size=..., sampler=train_sampler)

   model = ...
   model = torch.nn.parallel.DistributedDataParallel(model, device_ids=[args.local_rank])

   optimizer = optim.SGD(model.parameters())

   for epoch in range(100):
      for batch_idx, (data, target) in enumerate(train_loader):
          images = images.cuda(non_blocking=True)
          target = target.cuda(non_blocking=True)
          output = model(images)
          loss = criterion(output, target)

The method you start with mp.spawn must take in as its first argument a rank parameter (proc) in your example, which will be the rank of the process. Ranks are assigned in order of the processes starting in each worker.

After initializing the process group (as you have with init_process_group), you can use the dist.get_rank() API to get the global rank of the process in the world. To get the local rank, assuming a homogenous set up, mod the result of dist.get_rank() by the number of GPUs on the machine.

Your understanding of rank and local_rank is correct, though note that local_rank is not necessarily the GPU index. You can assign GPUs to ranks non-sequentially using torch.cuda.set_device() API, for example rank 0 could operate on GPU 1 and rank 1 could operate on GPU 0. Regarding world-size, world_size = total no. of active processes across all nodes. Usually this will be equal to the no. of GPUs available

@rvarm1 ,
Many thanks for your reply!

one more question:
what is global_rank? and how to get/set global_rank in code?

@rvarm1 ,

I am trying to run torch code on vscode on Ubuntu. The code is running on Ubuntu.
I am experiencing the following issue:

Exception has occurred: ProcessRaisedException

-- Process 0 terminated with the following error:
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/torch/multiprocessing/", line 59, in _wrap
    fn(i, *args)
  File "/home/smb/code_python/pytorch-image-models/", line 523, in main
    torch.distributed.init_process_group(backend='nccl', init_method='env://')
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/", line 520, in init_process_group
    store, rank, world_size = next(rendezvous_iterator)
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/", line 199, in _env_rendezvous_handler
    store = TCPStore(master_addr, master_port, world_size, start_daemon, timeout)
RuntimeError: Address already in use
  File "/home/smb/code_python/pytorch-image-models/", line 1089, in <module>
    mp.spawn(main, nprocs=2, args=(2, [333,444,555]))

I set the env like this:

            "env": {
                "CUDA_VISIBLE_DEVICES": "6,7",
                "WORLD_SIZE": "2",
                "RANK": "0",
                "MASTER_ADDR": "",
                "MASTER_PORT": "44147",
mp.spawn(main, nprocs=2, args=(2, [333,444,555]))

args = parser.parse_args()

def main(process_index, process_cnt:int, args_spawn:list=[]):

	args.device = 'cuda:%d' % args.local_rank
	torch.distributed.init_process_group(backend='nccl', init_method='env://')
	args.world_size = torch.distributed.get_world_size()
	args.rank = torch.distributed.get_rank()