Hi!
I am recently using torch elastic with c10d and min_nodes=1. I have succeeded in joining the existing training from other nodes dynamically. The training process blocks for rendezvous and restarts from the latest checkpoint with a new remaining iteration number (because of the updated world size), as expected.
However, when I try to kill the process on the other node, the c10d node also fails and the training is terminated. The error log with NCCL info is attached as follows:
ip-10-0-0-204:31012:31048 [0] include/socket.h:416 NCCL WARN Net : Connection closed by remote peer
ip-10-0-0-204:31012:31048 [0] NCCL INFO transport/net_socket.cc:405 -> 2
ip-10-0-0-204:31012:31048 [0] NCCL INFO include/net.h:28 -> 2
ip-10-0-0-204:31012:31048 [0] NCCL INFO transport/net.cc:357 -> 2
ip-10-0-0-204:31012:31048 [0] NCCL INFO proxy.cc:198 -> 2 [Proxy Thread]
Traceback (most recent call last):
File "./main.py", line 603, in <module>
main()
File "./main.py", line 188, in main
train(train_loader, model, criterion, optimizer, epoch, device_id, print_freq)
File "./main.py", line 471, in train
loss.backward()
File "/home/ubuntu/anaconda3/envs/pytorch_1.9_p37/lib/python3.7/site-packages/torch/_tensor.py", line 255, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
File "/home/ubuntu/anaconda3/envs/pytorch_1.9_p37/lib/python3.7/site-packages/torch/autograd/__init__.py", line 149, in backward
allow_unreachable=True, accumulate_grad=True) # allow_unreachable flag
RuntimeError: NCCL communicator was aborted on rank 1.
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 31012) of binary: /home/ubuntu/anaconda3/envs/pytorch_1.9_p37/bin/python
ERROR:torch.distributed.elastic.agent.server.api:Error waiting on exit barrier. Elapsed: 4.040053606033325 seconds
Traceback (most recent call last):
File "/home/ubuntu/anaconda3/envs/pytorch_1.9_p37/lib/python3.7/site-packages/torch/distributed/elastic/agent/server/api.py", line 889, in _exit_barrier
barrier_timeout=self._exit_barrier_timeout,
File "/home/ubuntu/anaconda3/envs/pytorch_1.9_p37/lib/python3.7/site-packages/torch/distributed/elastic/utils/store.py", line 67, in barrier
synchronize(store, data, rank, world_size, key_prefix, barrier_timeout)
File "/home/ubuntu/anaconda3/envs/pytorch_1.9_p37/lib/python3.7/site-packages/torch/distributed/elastic/utils/store.py", line 53, in synchronize
agent_data = get_all(store, key_prefix, world_size)
File "/home/ubuntu/anaconda3/envs/pytorch_1.9_p37/lib/python3.7/site-packages/torch/distributed/elastic/utils/store.py", line 31, in get_all
data = store.get(f"{prefix}{idx}")
RuntimeError: Stop_waiting response is expected
Exception in thread RendezvousKeepAliveTimer_0:
Traceback (most recent call last):
File "/home/ubuntu/anaconda3/envs/pytorch_1.9_p37/lib/python3.7/threading.py", line 926, in _bootstrap_inner
self.run()
File "/home/ubuntu/anaconda3/envs/pytorch_1.9_p37/lib/python3.7/threading.py", line 870, in run
self._target(*self._args, **self._kwargs)
File "/home/ubuntu/anaconda3/envs/pytorch_1.9_p37/lib/python3.7/site-packages/torch/distributed/elastic/rendezvous/utils.py", line 255, in _run
ctx.function(*ctx.args, **ctx.kwargs)
File "/home/ubuntu/anaconda3/envs/pytorch_1.9_p37/lib/python3.7/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py", line 1002, in _keep_alive_weak
self._keep_alive()
File "/home/ubuntu/anaconda3/envs/pytorch_1.9_p37/lib/python3.7/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py", line 1012, in _keep_alive
self._op_executor.run(op, deadline)
File "/home/ubuntu/anaconda3/envs/pytorch_1.9_p37/lib/python3.7/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py", line 546, in run
has_set = self._state_holder.sync()
File "/home/ubuntu/anaconda3/envs/pytorch_1.9_p37/lib/python3.7/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py", line 376, in sync
get_response = self._backend.get_state()
File "/home/ubuntu/anaconda3/envs/pytorch_1.9_p37/lib/python3.7/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 63, in get_state
base64_state: bytes = self._call_store("get", self._key)
File "/home/ubuntu/anaconda3/envs/pytorch_1.9_p37/lib/python3.7/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 103, in _call_store
return getattr(self._store, store_op)(*args, **kwargs)
MemoryError: std::bad_alloc
WARNING:torch.distributed.elastic.rendezvous.dynamic_rendezvous:The node 'ip-10-0-0-204.us-west-2.compute.internal_30838_0' has failed to shutdown the rendezvous 'yzs123' due to an error of type RendezvousConnectionError.
/home/ubuntu/anaconda3/envs/pytorch_1.9_p37/lib/python3.7/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py:367: UserWarning:
**********************************************************************
CHILD PROCESS FAILED WITH NO ERROR_FILE
**********************************************************************
CHILD PROCESS FAILED WITH NO ERROR_FILE
Child process 31012 (local_rank 1) FAILED (exitcode 1)
Error msg: Process failed with exitcode 1
Without writing an error file to <N/A>.
While this DOES NOT affect the correctness of your application,
no trace information about the error will be available for inspection.
Consider decorating your top level entrypoint function with
torch.distributed.elastic.multiprocessing.errors.record. Example:
from torch.distributed.elastic.multiprocessing.errors import record
@record
def trainer_main(args):
# do train
**********************************************************************
warnings.warn(_no_error_file_warning_msg(rank, failure))
Traceback (most recent call last):
File "/home/ubuntu/anaconda3/envs/pytorch_1.9_p37/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "/home/ubuntu/anaconda3/envs/pytorch_1.9_p37/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/ubuntu/anaconda3/envs/pytorch_1.9_p37/lib/python3.7/site-packages/torch/distributed/run.py", line 702, in <module>
main()
File "/home/ubuntu/anaconda3/envs/pytorch_1.9_p37/lib/python3.7/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 361, in wrapper
return f(*args, **kwargs)
File "/home/ubuntu/anaconda3/envs/pytorch_1.9_p37/lib/python3.7/site-packages/torch/distributed/run.py", line 698, in main
run(args)
File "/home/ubuntu/anaconda3/envs/pytorch_1.9_p37/lib/python3.7/site-packages/torch/distributed/run.py", line 692, in run
)(*cmd_args)
File "/home/ubuntu/anaconda3/envs/pytorch_1.9_p37/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 116, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/ubuntu/anaconda3/envs/pytorch_1.9_p37/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
failures=result.failures,
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
***************************************
./main.py FAILED
=======================================
Root Cause:
[0]:
time: 2021-10-31_06:11:31
rank: 1 (local_rank: 0)
exitcode: 1 (pid: 31012)
error_file: <N/A>
msg: "Process failed with exitcode 1"
=======================================
Other Failures:
<NO_OTHER_FAILURES>
***************************************
I suppose that it is not the expected behavior. Any help based on this information? I am using pytorch 1.9.1 with python 3.7, installed from conda.
Training script in Ubuntu Pastebin, which comes from the docker image torchelastic/example:0.2.0 with minor modification.
Launch script: NCCL_DEBUG=INFO python -m torch.distributed.run --nnodes=1:4 --nproc_per_node=1 --rdzv_id=xxxx --rdzv_backend=c10d --rdzv_endpoint=10.0.0.204:29400 ./main.py --arch resnet18 --epochs 20 --batch-size 32 --dist-backend nccl …/…/data/tiny-imagenet-200
(the tiny imagenet dataset also holds a copy in the image)