Hi Kiuk,
I use command LOGLEVEL=INFO torchrun --nnodes=1:2 --nproc_per_node=1 --max_restarts=3 --rdzv_id=1 --rdzv_backend=c10d --rdzv_endpoint=slave6:1234 main.py --if_name enp4s0 --batch_size 512 to start my script, where the last two parameters are customized for my application. By this command, my training script can run well in standalone mode. In this host, the stdout print the text below, including the events generated by new host admission (after the SIGTERM):
INFO:torch.distributed.launcher.api:Starting elastic_operator with launch configs:
entrypoint : main.py
min_nodes : 1
max_nodes : 2
nproc_per_node : 1
run_id : 1
rdzv_backend : c10d
rdzv_endpoint : slave6:1234
rdzv_configs : {âtimeoutâ: 900}
max_restarts : 3
monitor_interval : 5
log_dir : None
metrics_cfg : {}
INFO:torch.distributed.elastic.agent.server.local_elastic_agent:log directory set to: /tmp/torchelastic_2bl7vzba/1_8s8bqrfc
INFO:torch.distributed.elastic.agent.server.api:[default] starting workers for entrypoint: python3
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvousâing worker group
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous complete for workers. Result:
restart_count=0
master_addr=slave6
master_port=57905
group_rank=0
group_world_size=1
local_ranks=[0]
role_ranks=[0]
global_ranks=[0]
role_world_sizes=[1]
global_world_sizes=[1]
INFO:torch.distributed.elastic.agent.server.api:[default] Starting worker group
INFO:torch.distributed.elastic.multiprocessing:Setting worker0 reply file to: /tmp/torchelastic_2bl7vzba/1_8s8bqrfc/attempt_0/0/error.json
Files already downloaded and verified
Files already downloaded and verified
INFO:torch.distributed.elastic.agent.server.api:[default] Detected 1 new nodes from group_rank=0; will restart worker group
INFO:torch.distributed.elastic.agent.server.api:[default] Stopping worker group
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 1114254 closing signal SIGTERM
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvousâing worker group
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous complete for workers. Result:
restart_count=0
master_addr=slave6
master_port=55107
group_rank=0
group_world_size=2
local_ranks=[0]
role_ranks=[0]
global_ranks=[0]
role_world_sizes=[2]
global_world_sizes=[2]
INFO:torch.distributed.elastic.agent.server.api:[default] Starting worker group
INFO:torch.distributed.elastic.multiprocessing:Setting worker0 reply file to: /tmp/torchelastic_2bl7vzba/1_8s8bqrfc/attempt_0/0/error.json
Files already downloaded and verified
Files already downloaded and verified
In another host, I enter the same command: LOGLEVEL=INFO torchrun --nnodes=1:2 --nproc_per_node=1 --max_restarts=3 --rdzv_id=1 --rdzv_backend=c10d --rdzv_endpoint=slave6:1234 main.py --if_name enp4s0 --batch_size 512 and the stdout are listed below:
INFO:torch.distributed.launcher.api:Starting elastic_operator with launch configs:
entrypoint : main.py
min_nodes : 1
max_nodes : 2
nproc_per_node : 1
run_id : 1
rdzv_backend : c10d
rdzv_endpoint : slave6:1234
rdzv_configs : {âtimeoutâ: 900}
max_restarts : 3
monitor_interval : 5
log_dir : None
metrics_cfg : {}
INFO:torch.distributed.elastic.agent.server.local_elastic_agent:log directory set to: /tmp/torchelastic_x_83j4wn/1_wimfrwjn
INFO:torch.distributed.elastic.agent.server.api:[default] starting workers for entrypoint: python3
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvousâing worker group
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous complete for workers. Result:
restart_count=0
master_addr=slave6
master_port=55107
group_rank=1
group_world_size=2
local_ranks=[0]
role_ranks=[1]
global_ranks=[1]
role_world_sizes=[2]
global_world_sizes=[2]
INFO:torch.distributed.elastic.agent.server.api:[default] Starting worker group
INFO:torch.distributed.elastic.multiprocessing:Setting worker0 reply file to: /tmp/torchelastic_x_83j4wn/1_wimfrwjn/attempt_0/0/error.json
Files already downloaded and verified
Files already downloaded and verified
In this host, the exception thrown by the application is logged as below:
05/03/2022 10:11:51 AM - INFO - Added key: store_based_barrier_key:1 to store for rank: 1
05/03/2022 10:11:51 AM - INFO - Rank 1: Completed store-based barrier for key:store_based_barrier_key:1 with 2 nodes.
05/03/2022 10:11:55 AM - ERROR - NCCL error in: âŚ/torch/csrc/distributed/c10d/NCCLUtils.hpp:125, internal error, NCCL version 21.0.3
ncclInternalError: Internal check failed. This is either a bug in NCCL or due to memory corruption
Traceback (most recent call last):
File âmain.pyâ, line 74, in
model = torch.nn.parallel.DistributedDataParallel(model, device_ids=[local_rank])
File â/usr/local/lib/python3.8/dist-packages/torch/nn/parallel/distributed.pyâ, line 641, in init
dist._verify_params_across_processes(self.process_group, parameters)
RuntimeError: NCCL error in: âŚ/torch/csrc/distributed/c10d/NCCLUtils.hpp:125, internal error, NCCL version 21.0.3
ncclInternalError: Internal check failed. This is either a bug in NCCL or due to memory corruption
It looks strange that my log in the first node doesnât record any exceptions.
Thanks for your patient help.