I submit the batch file. In the batch file, I can get the node names of all nodes, which assigned by the server.
For each submission, the node names will be changed.
I also have another question. When I run with real data (big data), the gloo backend is stopped after 60 seconds in my program.
How to set up the time out for this situation. The program will run with 20-30 minutes.
In that case, I wonder if it would be possible to programmably figure out the master. E.g., ask all processes to sort all node names and then always use the first one?
How to set up the time out for this situation. The program will run with 20-30 minutes.
That per-RPC timeout is only for v1.6 which hasn’t been released yet. The feature available on master now though, but you will need to either use nightly binary or build from source to get that feature. If you are using v1.4 or v1.5, please try the other two options mentioned above.
What does the following code print in your environment?
import datetime, os
from torch.distributed import rpc
os.environ['MASTER_ADDR'] = 'localhost'
os.environ['MASTER_PORT'] = '29500'
rpc.init_rpc(
"worker",
rank=0,
world_size=1,
rpc_backend_options=rpc.ProcessGroupRpcBackendOptions(
num_send_recv_threads=16,
rpc_timeout = datetime.timedelta(seconds=1000) # note that this will change to float type to support TorchScript integration.
)
)
rpc.shutdown()