What is the difference among these elastic launcher?

zizhao.mo · April 19, 2022, 12:27pm

Hi,

I found that there are many launcher can be used to start elastic distributed training, e.g., torchelastic, torchrun, torch.distributed.run. But which one is recommended now? Is the training script applicable for A is also applicable for B?

Thanks

pbelevich · April 19, 2022, 4:25pm

@Kiuk_Chung can you please help here?

Kiuk_Chung · April 19, 2022, 5:21pm

They all use torch.distributed.run under the hood, which is using torchelastic.

torchrun is effectively equal to torch.distributed.run but it is a “console script” (Command Line Scripts — Python Packaging Tutorial) that we include for convenience so that you don’t have to run python -m torch.distributed.run every time and can simply invoke torchrun <same arguments>.

torch.distributed.launch is now on the path of deprecation, and internally calls torch.distributed.run. So it has a more restrictive set of options and a few option remappings when compared to torch.distributed.run for backwards compatibility.

We’ve updated our nightly documentation to explain this situation torchrun (Elastic Launch) — PyTorch 1.12 documentation

zizhao.mo · April 25, 2022, 6:44am

Hi, thanks for your response. I found that I cannot scale out the worker number of jobs (from 2 to 3) dynamically. Is this the inherit constraint from pytorch?

Kiuk_Chung · April 27, 2022, 8:45pm

Torchelastic should handle admission and departure of nodes. How are you running the min=2 max=3 job?

zizhao.mo · May 2, 2022, 3:24pm

Hi Kiuk,

I am sorry that I didn’t explain my dilemma well. Now I can use torchrun script to start a training in one host, to be specific, I start the script by using: torchrun --nnodes=1:2 --nproc_per_node=1 --max_restarts=3 --rdzv_id=1 --rdzv_backend=c10d --rdzv_endpoint=slave6:1234 main.py . In standalone mode, this runs well. However, when I start this script in another host, i.e. slave7, this script fails. The log of new node I added is as below:

05/02/2022 23:18:20 PM - INFO - Added key: store_based_barrier_key:1 to store for rank: 1
05/02/2022 23:18:20 PM - INFO - Rank 1: Completed store-based barrier for key:store_based_barrier_key:1 with 2 nodes.
05/02/2022 23:18:24 PM - ERROR - NCCL error in: …/torch/csrc/distributed/c10d/NCCLUtils.hpp:125, internal error, NCCL version 21.0.3
ncclInternalError: Internal check failed. This is either a bug in NCCL or due to memory corruption
Traceback (most recent call last):
File “main.py”, line 74, in
model = torch.nn.parallel.DistributedDataParallel(model, device_ids=[local_rank])
File “/usr/local/lib/python3.8/dist-packages/torch/nn/parallel/distributed.py”, line 641, in init
dist._verify_params_across_processes(self.process_group, parameters)
RuntimeError: NCCL error in: …/torch/csrc/distributed/c10d/NCCLUtils.hpp:125, internal error, NCCL version 21.0.3
ncclInternalError: Internal check failed. This is either a bug in NCCL or due to memory corruption

So I fail to dynamically add workers. How can I check what’s wrong in it?

Thanks for your patient response.

Kiuk_Chung · May 2, 2022, 6:08pm

The usage docs (torchrun (Elastic Launch) — PyTorch 1.11.0 documentation) has examples for different use-cases.

etcd is only required if:

you need a high degree of fault tolerance (aka node 0 fault-tolerance). By default rdzv_backend=c10d will create a data-plane on node 0, so if node 0 dies, then your job cannot recover and the job has to be retried. Using an external ectd store prevents this but the probability of node 0 failure is also pretty low. Ultimately it is your decision on whether you want to deal with setting up an external store for higher degree of fault tolerance, or prefer simplicity.
you don’t have the IP/hostname of “one” of the nodes up front. In this case you use the etcd url (which you will know up front since you setup an etcd cluster somewhere).

For the particular issue you are having with the two node setup, could you copy paste the exact CMD you ran on each node? Also could you set the env var LOGLEVEL=INFO when running torchrun (e.g. $ LOGLEVEL=INFO torchrun ...) and copy paste the outputs on each node?

zizhao.mo · May 3, 2022, 2:21am

Hi Kiuk,

I use command LOGLEVEL=INFO torchrun --nnodes=1:2 --nproc_per_node=1 --max_restarts=3 --rdzv_id=1 --rdzv_backend=c10d --rdzv_endpoint=slave6:1234 main.py --if_name enp4s0 --batch_size 512 to start my script, where the last two parameters are customized for my application. By this command, my training script can run well in standalone mode. In this host, the stdout print the text below, including the events generated by new host admission (after the SIGTERM):

INFO:torch.distributed.launcher.api:Starting elastic_operator with launch configs:
entrypoint : main.py
min_nodes : 1
max_nodes : 2
nproc_per_node : 1
run_id : 1
rdzv_backend : c10d
rdzv_endpoint : slave6:1234
rdzv_configs : {‘timeout’: 900}
max_restarts : 3
monitor_interval : 5
log_dir : None
metrics_cfg : {}

INFO:torch.distributed.elastic.agent.server.local_elastic_agent:log directory set to: /tmp/torchelastic_2bl7vzba/1_8s8bqrfc
INFO:torch.distributed.elastic.agent.server.api:[default] starting workers for entrypoint: python3
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous’ing worker group
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous complete for workers. Result:
restart_count=0
master_addr=slave6
master_port=57905
group_rank=0
group_world_size=1
local_ranks=[0]
role_ranks=[0]
global_ranks=[0]
role_world_sizes=[1]
global_world_sizes=[1]

INFO:torch.distributed.elastic.agent.server.api:[default] Starting worker group
INFO:torch.distributed.elastic.multiprocessing:Setting worker0 reply file to: /tmp/torchelastic_2bl7vzba/1_8s8bqrfc/attempt_0/0/error.json
Files already downloaded and verified
Files already downloaded and verified
INFO:torch.distributed.elastic.agent.server.api:[default] Detected 1 new nodes from group_rank=0; will restart worker group
INFO:torch.distributed.elastic.agent.server.api:[default] Stopping worker group
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 1114254 closing signal SIGTERM
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous’ing worker group
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous complete for workers. Result:
restart_count=0
master_addr=slave6
master_port=55107
group_rank=0
group_world_size=2
local_ranks=[0]
role_ranks=[0]
global_ranks=[0]
role_world_sizes=[2]
global_world_sizes=[2]

INFO:torch.distributed.elastic.agent.server.api:[default] Starting worker group
INFO:torch.distributed.elastic.multiprocessing:Setting worker0 reply file to: /tmp/torchelastic_2bl7vzba/1_8s8bqrfc/attempt_0/0/error.json
Files already downloaded and verified
Files already downloaded and verified

In another host, I enter the same command: LOGLEVEL=INFO torchrun --nnodes=1:2 --nproc_per_node=1 --max_restarts=3 --rdzv_id=1 --rdzv_backend=c10d --rdzv_endpoint=slave6:1234 main.py --if_name enp4s0 --batch_size 512 and the stdout are listed below:

INFO:torch.distributed.launcher.api:Starting elastic_operator with launch configs:
entrypoint : main.py
min_nodes : 1
max_nodes : 2
nproc_per_node : 1
run_id : 1
rdzv_backend : c10d
rdzv_endpoint : slave6:1234
rdzv_configs : {‘timeout’: 900}
max_restarts : 3
monitor_interval : 5
log_dir : None
metrics_cfg : {}

INFO:torch.distributed.elastic.agent.server.local_elastic_agent:log directory set to: /tmp/torchelastic_x_83j4wn/1_wimfrwjn
INFO:torch.distributed.elastic.agent.server.api:[default] starting workers for entrypoint: python3
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous’ing worker group
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous complete for workers. Result:
restart_count=0
master_addr=slave6
master_port=55107
group_rank=1
group_world_size=2
local_ranks=[0]
role_ranks=[1]
global_ranks=[1]
role_world_sizes=[2]
global_world_sizes=[2]

INFO:torch.distributed.elastic.agent.server.api:[default] Starting worker group
INFO:torch.distributed.elastic.multiprocessing:Setting worker0 reply file to: /tmp/torchelastic_x_83j4wn/1_wimfrwjn/attempt_0/0/error.json
Files already downloaded and verified
Files already downloaded and verified

In this host, the exception thrown by the application is logged as below:

05/03/2022 10:11:51 AM - INFO - Added key: store_based_barrier_key:1 to store for rank: 1
05/03/2022 10:11:51 AM - INFO - Rank 1: Completed store-based barrier for key:store_based_barrier_key:1 with 2 nodes.
05/03/2022 10:11:55 AM - ERROR - NCCL error in: …/torch/csrc/distributed/c10d/NCCLUtils.hpp:125, internal error, NCCL version 21.0.3
ncclInternalError: Internal check failed. This is either a bug in NCCL or due to memory corruption
Traceback (most recent call last):
File “main.py”, line 74, in
model = torch.nn.parallel.DistributedDataParallel(model, device_ids=[local_rank])
File “/usr/local/lib/python3.8/dist-packages/torch/nn/parallel/distributed.py”, line 641, in init
dist._verify_params_across_processes(self.process_group, parameters)
RuntimeError: NCCL error in: …/torch/csrc/distributed/c10d/NCCLUtils.hpp:125, internal error, NCCL version 21.0.3
ncclInternalError: Internal check failed. This is either a bug in NCCL or due to memory corruption

It looks strange that my log in the first node doesn’t record any exceptions.

Thanks for your patient help.

Kiuk_Chung · May 3, 2022, 5:20pm

could you run this as a single node job on the host that throws the error? e.g.

# assuming the host has two gpus if not make --nproc_per_node=1
LOGLEVEL=INFO torchrun --nnodes=1 --nproc_per_node=2  --rdzv_id=1 --rdzv_backend=c10d --rdzv_endpoint=localhost:29500 main.py --if_name enp4s0 --batch_size 512

zizhao.mo · May 4, 2022, 2:02am

Hi Kiuk,

I run this command in one host, and no error occurs. 2-worker training starts successfully.

Thanks