Multiple nodes with Pytorch (Only CPUs)

Hi,

For single node, I set

os.environ['MASTER_ADDR'] = 'localhost'
os.environ['MASTER_PORT'] = '29500'

and the size is as input parameter.

However, with multiple nodes, we have to set differently. But I did now know how to set it?

For example, I know the node names with 4 nodes as below.

C1-01
C1-02
C2-01
C2-02

When I submit the job, the node names will change.
How to set MASTER_ADDR for the program?

Thanks,

Hey @ph0123 do you mean before submitting jobs, neither node name nor node IP addresses are known to you?

Hi,

I submit the batch file. In the batch file, I can get the node names of all nodes, which assigned by the server.
For each submission, the node names will be changed.

I also have another question. When I run with real data (big data), the gloo backend is stopped after 60 seconds in my program.
How to set up the time out for this situation. The program will run with 20-30 minutes.

Thanks,

In that case, I wonder if it would be possible to programmably figure out the master. E.g., ask all processes to sort all node names and then always use the first one?

How to set up the time out for this situation. The program will run with 20-30 minutes.

Are you using RPC or DDP? For RPC, checkout this PR (https://github.com/pytorch/pytorch/pull/38577). For DDP, init_process_group does take a timeout argument.

1 Like

The 60 second timeout might be the default RPC timeout. Which version of PyTorch are you using?

In v1.4, there is a hidden _set_rpc_timeout API.

In v1.5, you can customize the ProcessGroupRpcBackendOptions to provide a timeout.

1 Like

We will be adding per-RPC timeout in v1.6. See

  1. https://github.com/pytorch/pytorch/issues/32686
  2. https://github.com/pytorch/pytorch/issues/36000

Thanks,
For multiple nodes, I think I can print out to a file. The first name node is the master node.
Let me try the timeout.

1 Like

HI,
I try to set the time out as in the git. https://github.com/pytorch/pytorch/issues/32686

rpc.rpc_async(my_target, add_outlinks, args=(arr_send[i],source),timeout=None)

The error:

TypeError: rpc_async() got an unexpected keyword argument 'timeout'

Perhaps, rpc is not have the “timeout” parameter.

Thanks,

That per-RPC timeout is only for v1.6 which hasn’t been released yet. The feature available on master now though, but you will need to either use nightly binary or build from source to get that feature. If you are using v1.4 or v1.5, please try the other two options mentioned above.

What does the following code print in your environment?

import torch
torch.__version__

I used pytorch v1.5
Thanks,

Cool, then below should be the way to go. The link below points to the doc that contains an example.

In v1.5, you can customize the ProcessGroupRpcBackendOptions to provide a timeout.

Yes,

rpc.init_rpc(my_name, rank=rank, world_size=size, rpc_backend_options=rpc.ProcessGroupRpcBackendOptions(num_send_recv_threads=16,datetime.timedelta(seconds=1000)))  # initial_rpc

Output:

    rpc.init_rpc(my_name, rank=rank, world_size=size, rpc_backend_options=rpc.ProcessGroupRpcBackendOptions(num_send_recv_threads=16,datetime.timedelta(seconds=1000)))  # initial_rpc
                                                                                                                                    ^
SyntaxError: positional argument follows keyword argument

Thanks for reporting, it’s an error in the doc, I think it needs to be:

rpc.ProcessGroupRpcBackendOptions(
    num_send_recv_threads=16,
    rpc_timeout=datetime.timedelta(seconds=1000)
)

Let me try.

1 Like
import datetime, os
from torch.distributed import rpc

os.environ['MASTER_ADDR'] = 'localhost'
os.environ['MASTER_PORT'] = '29500'

rpc.init_rpc(
    "worker",
    rank=0,
    world_size=1,
    rpc_backend_options=rpc.ProcessGroupRpcBackendOptions(
        num_send_recv_threads=16,
        rpc_timeout = datetime.timedelta(seconds=1000)  # note that this will change to float type to support TorchScript integration. 
    )
)

rpc.shutdown()

Yep, above should work. Will submit a PR to fix.

2 Likes