Multiple nodes with Pytorch (Only CPUs)

ph0123 · June 10, 2020, 8:31am

Hi,

For single node, I set

os.environ['MASTER_ADDR'] = 'localhost'
os.environ['MASTER_PORT'] = '29500'

and the size is as input parameter.

However, with multiple nodes, we have to set differently. But I did now know how to set it?

For example, I know the node names with 4 nodes as below.

C1-01
C1-02
C2-01
C2-02

When I submit the job, the node names will change.
How to set MASTER_ADDR for the program?

Thanks,

mrshenli · June 10, 2020, 2:00pm

Hey @ph0123 do you mean before submitting jobs, neither node name nor node IP addresses are known to you?

ph0123 · June 10, 2020, 3:00pm

Hi,

I submit the batch file. In the batch file, I can get the node names of all nodes, which assigned by the server.
For each submission, the node names will be changed.

I also have another question. When I run with real data (big data), the gloo backend is stopped after 60 seconds in my program.
How to set up the time out for this situation. The program will run with 20-30 minutes.

Thanks,

mrshenli · June 10, 2020, 3:19pm

In that case, I wonder if it would be possible to programmably figure out the master. E.g., ask all processes to sort all node names and then always use the first one?

How to set up the time out for this situation. The program will run with 20-30 minutes.

Are you using RPC or DDP? For RPC, checkout this PR (Larger timeout for operations against ProcessGroup for RPC by rohan-varma · Pull Request #38577 · pytorch/pytorch · GitHub). For DDP, init_process_group does take a timeout argument.

mrshenli · June 10, 2020, 3:24pm

The 60 second timeout might be the default RPC timeout. Which version of PyTorch are you using?

In v1.4, there is a hidden _set_rpc_timeout API.

In v1.5, you can customize the ProcessGroupRpcBackendOptions to provide a timeout.

mrshenli · June 10, 2020, 3:27pm

We will be adding per-RPC timeout in v1.6. See

ph0123 · June 10, 2020, 3:29pm

Thanks,
For multiple nodes, I think I can print out to a file. The first name node is the master node.
Let me try the timeout.

ph0123 · June 10, 2020, 3:42pm

HI,
I try to set the time out as in the git. https://github.com/pytorch/pytorch/issues/32686

rpc.rpc_async(my_target, add_outlinks, args=(arr_send[i],source),timeout=None)

The error:

TypeError: rpc_async() got an unexpected keyword argument 'timeout'

Perhaps, rpc is not have the “timeout” parameter.

Thanks,

mrshenli · June 10, 2020, 3:51pm

That per-RPC timeout is only for v1.6 which hasn’t been released yet. The feature available on master now though, but you will need to either use nightly binary or build from source to get that feature. If you are using v1.4 or v1.5, please try the other two options mentioned above.

What does the following code print in your environment?

import torch
torch.__version__

ph0123 · June 10, 2020, 3:52pm

I used pytorch v1.5
Thanks,

mrshenli · June 10, 2020, 3:53pm

Cool, then below should be the way to go. The link below points to the doc that contains an example.

In v1.5, you can customize the ProcessGroupRpcBackendOptions to provide a timeout.

ph0123 · June 10, 2020, 3:59pm

Yes,

rpc.init_rpc(my_name, rank=rank, world_size=size, rpc_backend_options=rpc.ProcessGroupRpcBackendOptions(num_send_recv_threads=16,datetime.timedelta(seconds=1000)))  # initial_rpc

Output:

    rpc.init_rpc(my_name, rank=rank, world_size=size, rpc_backend_options=rpc.ProcessGroupRpcBackendOptions(num_send_recv_threads=16,datetime.timedelta(seconds=1000)))  # initial_rpc
                                                                                                                                    ^
SyntaxError: positional argument follows keyword argument

mrshenli · June 10, 2020, 4:07pm

Thanks for reporting, it’s an error in the doc, I think it needs to be:

rpc.ProcessGroupRpcBackendOptions(
    num_send_recv_threads=16,
    rpc_timeout=datetime.timedelta(seconds=1000)
)

Let me try.

mrshenli · June 10, 2020, 4:10pm

import datetime, os
from torch.distributed import rpc

os.environ['MASTER_ADDR'] = 'localhost'
os.environ['MASTER_PORT'] = '29500'

rpc.init_rpc(
    "worker",
    rank=0,
    world_size=1,
    rpc_backend_options=rpc.ProcessGroupRpcBackendOptions(
        num_send_recv_threads=16,
        rpc_timeout = datetime.timedelta(seconds=1000)  # note that this will change to float type to support TorchScript integration. 
    )
)

rpc.shutdown()

Yep, above should work. Will submit a PR to fix.

mrshenli · June 10, 2020, 4:13pm