PyTorch distributed example code hang. Deadlock?

Hi Thanks for your replies!

I keep getting the error:

[W ProcessGroupGloo.cpp:558] Warning: Unable to resolve hostname to a (local) address. Using the loopback address as fallback. Manually set the network interface to bind to with GLOO_SOCKET_IFNAME. (function operator())

I am unsure why.

My setting is this:

        # set up the master's ip address so this child process can coordinate
        # os.environ['MASTER_ADDR'] = '127.0.0.1'
        os.environ['MASTER_ADDR'] = 'localhost'
        os.environ['MASTER_PORT'] = '12355'

        # - use NCCL if you are using gpus: https://pytorch.org/tutorials/intermediate/dist_tuto.html#communication-backends
        if torch.cuda.is_available():
            backend = 'nccl'
        # Initializes the default distributed process group, and this will also initialize the distributed package.
        dist.init_process_group(backend, rank=rank, world_size=world_size)

does this look wrong to you? Is what you mainly suggest to change the port number?