Problems with init_process_group.
mp.spawn(init_distributed, args=(hparams, n_gpus, group_name,), nprocs=n_gpus, join=True)
and
def init_distributed(rank, hparams, n_gpus, group_name):
os.environ['MASTER_ADDR'] = 'localhost'
os.environ['MASTER_PORT'] = '54321'
print(f"{rank=} init complete")
dist.init_process_group(
backend=hparams.dist_backend,
init_method=hparams.dist_url,
world_size=n_gpus,
rank=rank,
group_name=group_name,
)
print("Done initializing distributed")
I get:
FP16 Run: True
Dynamic Loss Scaling: True
Distributed Run: True
cuDNN Enabled: True
cuDNN Benchmark: False
Initializing Distributed
Initializing Distributed
rank=1 init complete
rank=0 init complete
Done initializing distributed
Done initializing distributed
and next error:
RuntimeError: Default process group has not been initialized, please make sure to call init_process_group.
I use 4090 and 3080ti gpus.
Can someone please help me?