DDP Error : RuntimeError: Default process group has not been initialized, please make sure to call init_process_group

banarutz · November 9, 2022, 3:43pm

Hello,

I’m trying implement DDP for training on 2 ore more nodes with just 1 gpu / node and I’ve got the following issue:

if __name__ == "__main__":
setup(rank = args.rank, world_size= args.nodes, args= args)
	mp.spawn(train, nprocs=2, args=(args, ))

where the setup function is:

def setup(rank, world_size, args):
	os.environ['MASTER_ADDR'] = args.master_addr
	os.environ['MASTER_PORT'] = args.master_port
	os.environ['RANK'] = str (args.rank)
	os.environ['WORLD_SIZE'] = str (args.nodes)
    # initialize the process group
	dist.init_process_group("nccl", rank=rank, world_size=world_size)

In train i initialize the model and doing the training loops. I’ve got the following error:

RuntimeError: Default process group has not been initialized, please make sure to call init_process_group.

But how, i initialized the process group one line above.
Thanks!

fduwjj · November 9, 2022, 6:14pm

Correct me if I am wrong. Looks like that you first init process group and then spawn processes?
If so, you might want to do it opposite way. Basically you can add log after init_process_group and do you see init_process_group has been finished for each rank?

banarutz · November 14, 2022, 1:36pm

If I do it like this, spawn and after the init_process_group:

if __name__ == "__main__":
	mp.spawn(train, nprocs=1, args=(args, ))
	setup(rank = args.rank, world_size= args.nodes, args= args)

I have the following error:

Traceback (most recent call last):
   line 233, in <module>
    mp.spawn(train, nprocs=1, args=(args, ))
  File "/home/user/anaconda3/envs/p3m/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 240, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
  File "/home/user/anaconda3/envs/p3m/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 198, in start_processes
    while not context.join():
  File "/home/user/anaconda3/envs/p3m/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 160, in join
    raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException: 

-- Process 0 terminated with the following error:
Traceback (most recent call last):
  File "/home/user/anaconda3/envs/p3m/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 69, in _wrap
    fn(i, *args)
  File "...........path/train_ddp_example.py", line 70, in train
    sampler_train, train_loader = load_dataset(args)
  File "...........path/train_ddp_example.py", line 54, in load_dataset
    sampler_train = DistributedSampler(train_set)
  File "/home/user/anaconda3/envs/p3m/lib/python3.10/site-packages/torch/utils/data/distributed.py", line 67, in __init__
    num_replicas = dist.get_world_size()
  File "/home/user/anaconda3/envs/p3m/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 867, in get_world_size
    return _get_group_size(group)
  File "/home/user/anaconda3/envs/p3m/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 325, in _get_group_size
    default_pg = _get_default_group()
  File "/home/user/anaconda3/envs/p3m/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 429, in _get_default_group
    raise RuntimeError(
RuntimeError: Default process group has not been initialized, please make sure to call init_process_group.

Any ideas?

suraj.pt · November 14, 2022, 4:18pm

Hi @banarutz
You need to call init_process_group for each spawned process.

That is,

def main(args):
  setup(args)
  train(args)

if __name__ == "__main__":
	mp.spawn(main, nprocs=1, args=(args, ))

Here’s a tutorial where I explain more about structuring your script to use DDP with torch.multiprocessing: Multi GPU training with DDP — PyTorch Tutorials 1.13.0+cu117 documentation

Alternatively, you can use torchrun for a simpler structure and automatic setting of env variables. See this tutorial: Fault-tolerant Distributed Training with torchrun — PyTorch Tutorials 1.13.0+cu117 documentation