Training process is terminated when node fails for torch elastic

Hi!

I am recently using torch elastic with c10d and min_nodes=1. I have succeeded in joining the existing training from other nodes dynamically. The training process blocks for rendezvous and restarts from the latest checkpoint with a new remaining iteration number (because of the updated world size), as expected.

However, when I try to kill the process on the other node, the c10d node also fails and the training is terminated. The error log with NCCL info is attached as follows:

ip-10-0-0-204:31012:31048 [0] include/socket.h:416 NCCL WARN Net : Connection closed by remote peer
ip-10-0-0-204:31012:31048 [0] NCCL INFO transport/net_socket.cc:405 -> 2
ip-10-0-0-204:31012:31048 [0] NCCL INFO include/net.h:28 -> 2
ip-10-0-0-204:31012:31048 [0] NCCL INFO transport/net.cc:357 -> 2
ip-10-0-0-204:31012:31048 [0] NCCL INFO proxy.cc:198 -> 2 [Proxy Thread]
Traceback (most recent call last):
  File "./main.py", line 603, in <module>
    main()
  File "./main.py", line 188, in main
    train(train_loader, model, criterion, optimizer, epoch, device_id, print_freq)
  File "./main.py", line 471, in train
    loss.backward()
  File "/home/ubuntu/anaconda3/envs/pytorch_1.9_p37/lib/python3.7/site-packages/torch/_tensor.py", line 255, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "/home/ubuntu/anaconda3/envs/pytorch_1.9_p37/lib/python3.7/site-packages/torch/autograd/__init__.py", line 149, in backward
    allow_unreachable=True, accumulate_grad=True)  # allow_unreachable flag
RuntimeError: NCCL communicator was aborted on rank 1.
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 31012) of binary: /home/ubuntu/anaconda3/envs/pytorch_1.9_p37/bin/python
ERROR:torch.distributed.elastic.agent.server.api:Error waiting on exit barrier. Elapsed: 4.040053606033325 seconds
Traceback (most recent call last):
  File "/home/ubuntu/anaconda3/envs/pytorch_1.9_p37/lib/python3.7/site-packages/torch/distributed/elastic/agent/server/api.py", line 889, in _exit_barrier
    barrier_timeout=self._exit_barrier_timeout,
  File "/home/ubuntu/anaconda3/envs/pytorch_1.9_p37/lib/python3.7/site-packages/torch/distributed/elastic/utils/store.py", line 67, in barrier
    synchronize(store, data, rank, world_size, key_prefix, barrier_timeout)
  File "/home/ubuntu/anaconda3/envs/pytorch_1.9_p37/lib/python3.7/site-packages/torch/distributed/elastic/utils/store.py", line 53, in synchronize
    agent_data = get_all(store, key_prefix, world_size)
  File "/home/ubuntu/anaconda3/envs/pytorch_1.9_p37/lib/python3.7/site-packages/torch/distributed/elastic/utils/store.py", line 31, in get_all
    data = store.get(f"{prefix}{idx}")
RuntimeError: Stop_waiting response is expected
Exception in thread RendezvousKeepAliveTimer_0:
Traceback (most recent call last):
  File "/home/ubuntu/anaconda3/envs/pytorch_1.9_p37/lib/python3.7/threading.py", line 926, in _bootstrap_inner
    self.run()
  File "/home/ubuntu/anaconda3/envs/pytorch_1.9_p37/lib/python3.7/threading.py", line 870, in run
    self._target(*self._args, **self._kwargs)
  File "/home/ubuntu/anaconda3/envs/pytorch_1.9_p37/lib/python3.7/site-packages/torch/distributed/elastic/rendezvous/utils.py", line 255, in _run
    ctx.function(*ctx.args, **ctx.kwargs)
  File "/home/ubuntu/anaconda3/envs/pytorch_1.9_p37/lib/python3.7/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py", line 1002, in _keep_alive_weak
    self._keep_alive()
  File "/home/ubuntu/anaconda3/envs/pytorch_1.9_p37/lib/python3.7/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py", line 1012, in _keep_alive
    self._op_executor.run(op, deadline)
  File "/home/ubuntu/anaconda3/envs/pytorch_1.9_p37/lib/python3.7/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py", line 546, in run
    has_set = self._state_holder.sync()
  File "/home/ubuntu/anaconda3/envs/pytorch_1.9_p37/lib/python3.7/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py", line 376, in sync
    get_response = self._backend.get_state()
  File "/home/ubuntu/anaconda3/envs/pytorch_1.9_p37/lib/python3.7/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 63, in get_state
    base64_state: bytes = self._call_store("get", self._key)
  File "/home/ubuntu/anaconda3/envs/pytorch_1.9_p37/lib/python3.7/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 103, in _call_store
    return getattr(self._store, store_op)(*args, **kwargs)
MemoryError: std::bad_alloc

WARNING:torch.distributed.elastic.rendezvous.dynamic_rendezvous:The node 'ip-10-0-0-204.us-west-2.compute.internal_30838_0' has failed to shutdown the rendezvous 'yzs123' due to an error of type RendezvousConnectionError.
/home/ubuntu/anaconda3/envs/pytorch_1.9_p37/lib/python3.7/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py:367: UserWarning: 

**********************************************************************
               CHILD PROCESS FAILED WITH NO ERROR_FILE                
**********************************************************************
CHILD PROCESS FAILED WITH NO ERROR_FILE
Child process 31012 (local_rank 1) FAILED (exitcode 1)
Error msg: Process failed with exitcode 1
Without writing an error file to <N/A>.
While this DOES NOT affect the correctness of your application,
no trace information about the error will be available for inspection.
Consider decorating your top level entrypoint function with
torch.distributed.elastic.multiprocessing.errors.record. Example:

  from torch.distributed.elastic.multiprocessing.errors import record

  @record
  def trainer_main(args):
     # do train
**********************************************************************
  warnings.warn(_no_error_file_warning_msg(rank, failure))
Traceback (most recent call last):
  File "/home/ubuntu/anaconda3/envs/pytorch_1.9_p37/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/home/ubuntu/anaconda3/envs/pytorch_1.9_p37/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/ubuntu/anaconda3/envs/pytorch_1.9_p37/lib/python3.7/site-packages/torch/distributed/run.py", line 702, in <module>
    main()
  File "/home/ubuntu/anaconda3/envs/pytorch_1.9_p37/lib/python3.7/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 361, in wrapper
    return f(*args, **kwargs)
  File "/home/ubuntu/anaconda3/envs/pytorch_1.9_p37/lib/python3.7/site-packages/torch/distributed/run.py", line 698, in main
    run(args)
  File "/home/ubuntu/anaconda3/envs/pytorch_1.9_p37/lib/python3.7/site-packages/torch/distributed/run.py", line 692, in run
    )(*cmd_args)
  File "/home/ubuntu/anaconda3/envs/pytorch_1.9_p37/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 116, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/ubuntu/anaconda3/envs/pytorch_1.9_p37/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
    failures=result.failures,
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
***************************************
            ./main.py FAILED           
=======================================
Root Cause:
[0]:
  time: 2021-10-31_06:11:31
  rank: 1 (local_rank: 0)
  exitcode: 1 (pid: 31012)
  error_file: <N/A>
  msg: "Process failed with exitcode 1"
=======================================
Other Failures:
  <NO_OTHER_FAILURES>
***************************************

I suppose that it is not the expected behavior. Any help based on this information? I am using pytorch 1.9.1 with python 3.7, installed from conda.

Training script in Ubuntu Pastebin, which comes from the docker image torchelastic/example:0.2.0 with minor modification.
Launch script: NCCL_DEBUG=INFO python -m torch.distributed.run --nnodes=1:4 --nproc_per_node=1 --rdzv_id=xxxx --rdzv_backend=c10d --rdzv_endpoint=10.0.0.204:29400 ./main.py --arch resnet18 --epochs 20 --batch-size 32 --dist-backend nccl …/…/data/tiny-imagenet-200
(the tiny imagenet dataset also holds a copy in the image)

Hey @Kiuk_Chung, is this behavior expected?

hmm how do you kill the agent? using ctrl+c? if so can you try to kill by sending a SIGTERM instead? Might be related to [distributed elastic] How to tolerate agent failures with etcd rendezvous backend? · Issue #67616 · pytorch/pytorch · GitHub

Looks like in python SIGINT (send when ctrl + c from terminal) produces a KeyboardInterrupt error which results in the shutdown() method of rendezvous being called in the finally block. The shutdown() should only get called on an orderly shutdown not a “failure” hence closes the rendezvous permanently failing all other nodes with a RendezvousClosedException

Elastic was built to handle real life faults which would just SIGTERM or SIGKILL or in the worst case just cause the node itself to just crash and disappear. All of which doesn’t produce a SIGINT.

Hi, KiuK!

Yes, I used Ctrl+C to kill the process and the killed process receives the Keyboard Interruption exception.

I have just tried sending SIGTERM to the main python process. However, the training process failed no matter using gloo/nccl as dist backend or c10d/etcd as rdzv backend.

The related log consistent with the former log except for the kill signal is attached: Ubuntu Pastebin

Thanks for your time and looking forward to further investigation and your demo video in the github issue!

Ref [distributed elastic] rendezvous brain split with etcd backend · Issue #67616 · pytorch/pytorch · GitHub

@Kiuk_Chung Seems that SIGTERM does not work, too. We need to use SIGKILL. Can we remove the code pytorch/api.py at cd51d2a3ecc8ac579bee910f6bafe41a4c41ca80 · pytorch/pytorch · GitHub to avoid shutdown when the agent received SIGTERM?

I see thanks for the investigation. That code was added in: [torchelastic] Improve process termination logic (#61602) · pytorch/pytorch@0c55f1b · GitHub

This only affects worker (not agent) exception handling so I’m still not understanding why you suspect this is the reason why scale down is not working when a SIGTERM is sent to the agent

Oh I mis read the code. The signal handler is registered in the agent process. I have to talk to @aivanou who added this piece of logic. I believe that termination handler was added to make sure there are no orphaned trainers when the agent gets signaled. Instead of removing the termination handler we probably need to catch the SignalException in the main loop and avoid the finally block that shuts down rdzv

I can confirm this is indeed a bug. Please track the progress of the fix: [torch/elastic] Scale down does not work correctly when agent is killed with SIGINT, SIGTERM · Issue #67742 · pytorch/pytorch · GitHub. The fix itself is quite simple.

It solved! Thank you all! @Kiuk_Chung and @gaocegege !

1 Like

And also @mrshenli ! (new user cannot mention more than two people in a post!)