Why init_process_group method hangs when world_size > 1

Hello! I need your help. I try to execute an inference of unet model using 2 GPUs Tesla P40 with 24 GB. The code to inicialize the distribuited enviroment of pytroch torch==1.13.1 is:

import …

os.environ[“CUDA_VISIBLE_DEVICES”] = “0,1”
os.environ[‘NCCL_P2P_DISABLE’] = ‘1’
os.environ[‘NCCL_P2P_LEVEL’] = ‘LOC’

def find_free_port():
with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as s:
s.bind((‘’, 0))
return s.getsockname()[1]

def setup(rank, world_size):

os.environ['MASTER_ADDR'] = 'localhost'
os.environ['MASTER_PORT'] = str(find_free_port())  # I have doubts if I should specify one directly or selecting a different one each time works

# initialize the process group
dist.init_process_group("nccl", rank=rank, world_size=world_size)

def cleanup():
dist.destroy_process_group()

def execution(rank):

#...load the model using pth file

setup(rank, 2)  

torch.cuda.set_device(0)

model = FSDP(model)

model.eval()

#...inference of model

def main():

    WORLD_SIZE = torch.cuda.device_count()

    mp.spawn(execution, # this line hangs!!!!!!!!!!!!!!
             args=(),
            nprocs=2,
            join=True)

if name == ‘main’:
main()

I was guided by the example on this page: Getting Started with Fully Sharded Data Parallel(FSDP) — PyTorch Tutorials 2.3.0+cu121 documentation, but I focused it only on inference.

I looked at the examples in the forum but with none of the proposed solutions I could solve the problem. I don’t know if it is a problem with the GPU architecture of my PC.

How can I fix the problem?

do you know what part of execution is hanging?

If you are using FSDP just for inference, then I assume your model is large and you just need to split it. Have you considered using PipelineParallelism instead? We are going to release the API in prototype state in coming weeks (pytorch/torch/distributed/pipelining at main · pytorch/pytorch · GitHub)

Yes, the execution part that hangs is the dist.init_process_group line in the setup function. I can try the pipeline you mentioned but I still have a question as to why I can’t initialize the process group. When I set the world_size to 1 it does run.

Oh i see, yeah the issue is this line. The port needs to be the same across processes. So each rank should have the same value. MASTER_ADDR and MASTER_PORT specifies the address which all process rendezvous, so all ranks need to be aligned to synchronize.

Hello, this is the code updated. It still doesn’t work for me, I explain in the lines where it gives me an error.

def setup(rank, world_size, port):

   os.environ['MASTER_ADDR'] = 'localhost'
  os.environ['MASTER_PORT'] = str(port)  # I have doubts if I should specify one             directly or selecting a different one each time works

  # initialize the process group
  dist.init_process_group("nccl", rank=rank, world_size=world_size)

def cleanup():
dist.destroy_process_group()

def execution(rank, port):

  #...load the model using pth file

  setup(rank, 2, port)  

  torch.cuda.set_device(0)

  model = FSDP(model)

  model.eval()

  #...inference of model

def main():

WORLD_SIZE = torch.cuda.device_count()

port=find_free_port() # New line

mp.spawn(execution,
         args=([port]),   # line updated
        nprocs=2,  #This value give me the error: Duplicate GPU detected : rank 1 and rank 0 both on CUDA device 3b000!!!!!!!! when i set it with 1 the code hangs!!!!!!!!!!!!!!. 
        join=True)

if name == ‘main’:
main()