Training process is terminated when node fails for torch elastic

yzs · November 1, 2021, 5:53am

Hi!

I am recently using torch elastic with c10d and min_nodes=1. I have succeeded in joining the existing training from other nodes dynamically. The training process blocks for rendezvous and restarts from the latest checkpoint with a new remaining iteration number (because of the updated world size), as expected.

However, when I try to kill the process on the other node, the c10d node also fails and the training is terminated. The error log with NCCL info is attached as follows:

ip-10-0-0-204:31012:31048 [0] include/socket.h:416 NCCL WARN Net : Connection closed by remote peer
ip-10-0-0-204:31012:31048 [0] NCCL INFO transport/net_socket.cc:405 -> 2
ip-10-0-0-204:31012:31048 [0] NCCL INFO include/net.h:28 -> 2
ip-10-0-0-204:31012:31048 [0] NCCL INFO transport/net.cc:357 -> 2
ip-10-0-0-204:31012:31048 [0] NCCL INFO proxy.cc:198 -> 2 [Proxy Thread]
Traceback (most recent call last):
  File "./main.py", line 603, in <module>
    main()
  File "./main.py", line 188, in main
    train(train_loader, model, criterion, optimizer, epoch, device_id, print_freq)
  File "./main.py", line 471, in train
    loss.backward()
  File "/home/ubuntu/anaconda3/envs/pytorch_1.9_p37/lib/python3.7/site-packages/torch/_tensor.py", line 255, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "/home/ubuntu/anaconda3/envs/pytorch_1.9_p37/lib/python3.7/site-packages/torch/autograd/__init__.py", line 149, in backward
    allow_unreachable=True, accumulate_grad=True)  # allow_unreachable flag
RuntimeError: NCCL communicator was aborted on rank 1.
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 31012) of binary: /home/ubuntu/anaconda3/envs/pytorch_1.9_p37/bin/python
ERROR:torch.distributed.elastic.agent.server.api:Error waiting on exit barrier. Elapsed: 4.040053606033325 seconds
Traceback (most recent call last):
  File "/home/ubuntu/anaconda3/envs/pytorch_1.9_p37/lib/python3.7/site-packages/torch/distributed/elastic/agent/server/api.py", line 889, in _exit_barrier
    barrier_timeout=self._exit_barrier_timeout,
  File "/home/ubuntu/anaconda3/envs/pytorch_1.9_p37/lib/python3.7/site-packages/torch/distributed/elastic/utils/store.py", line 67, in barrier
    synchronize(store, data, rank, world_size, key_prefix, barrier_timeout)
  File "/home/ubuntu/anaconda3/envs/pytorch_1.9_p37/lib/python3.7/site-packages/torch/distributed/elastic/utils/store.py", line 53, in synchronize
    agent_data = get_all(store, key_prefix, world_size)
  File "/home/ubuntu/anaconda3/envs/pytorch_1.9_p37/lib/python3.7/site-packages/torch/distributed/elastic/utils/store.py", line 31, in get_all
    data = store.get(f"{prefix}{idx}")
RuntimeError: Stop_waiting response is expected
Exception in thread RendezvousKeepAliveTimer_0:
Traceback (most recent call last):
  File "/home/ubuntu/anaconda3/envs/pytorch_1.9_p37/lib/python3.7/threading.py", line 926, in _bootstrap_inner
    self.run()
  File "/home/ubuntu/anaconda3/envs/pytorch_1.9_p37/lib/python3.7/threading.py", line 870, in run
    self._target(*self._args, **self._kwargs)
  File "/home/ubuntu/anaconda3/envs/pytorch_1.9_p37/lib/python3.7/site-packages/torch/distributed/elastic/rendezvous/utils.py", line 255, in _run
    ctx.function(*ctx.args, **ctx.kwargs)
  File "/home/ubuntu/anaconda3/envs/pytorch_1.9_p37/lib/python3.7/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py", line 1002, in _keep_alive_weak
    self._keep_alive()
  File "/home/ubuntu/anaconda3/envs/pytorch_1.9_p37/lib/python3.7/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py", line 1012, in _keep_alive
    self._op_executor.run(op, deadline)
  File "/home/ubuntu/anaconda3/envs/pytorch_1.9_p37/lib/python3.7/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py", line 546, in run
    has_set = self._state_holder.sync()
  File "/home/ubuntu/anaconda3/envs/pytorch_1.9_p37/lib/python3.7/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py", line 376, in sync
    get_response = self._backend.get_state()
  File "/home/ubuntu/anaconda3/envs/pytorch_1.9_p37/lib/python3.7/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 63, in get_state
    base64_state: bytes = self._call_store("get", self._key)
  File "/home/ubuntu/anaconda3/envs/pytorch_1.9_p37/lib/python3.7/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 103, in _call_store
    return getattr(self._store, store_op)(*args, **kwargs)
MemoryError: std::bad_alloc

WARNING:torch.distributed.elastic.rendezvous.dynamic_rendezvous:The node 'ip-10-0-0-204.us-west-2.compute.internal_30838_0' has failed to shutdown the rendezvous 'yzs123' due to an error of type RendezvousConnectionError.
/home/ubuntu/anaconda3/envs/pytorch_1.9_p37/lib/python3.7/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py:367: UserWarning: 

**********************************************************************
               CHILD PROCESS FAILED WITH NO ERROR_FILE                
**********************************************************************
CHILD PROCESS FAILED WITH NO ERROR_FILE
Child process 31012 (local_rank 1) FAILED (exitcode 1)
Error msg: Process failed with exitcode 1
Without writing an error file to <N/A>.
While this DOES NOT affect the correctness of your application,
no trace information about the error will be available for inspection.
Consider decorating your top level entrypoint function with
torch.distributed.elastic.multiprocessing.errors.record. Example:

  from torch.distributed.elastic.multiprocessing.errors import record

  @record
  def trainer_main(args):
     # do train
**********************************************************************
  warnings.warn(_no_error_file_warning_msg(rank, failure))
Traceback (most recent call last):
  File "/home/ubuntu/anaconda3/envs/pytorch_1.9_p37/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/home/ubuntu/anaconda3/envs/pytorch_1.9_p37/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/ubuntu/anaconda3/envs/pytorch_1.9_p37/lib/python3.7/site-packages/torch/distributed/run.py", line 702, in <module>
    main()
  File "/home/ubuntu/anaconda3/envs/pytorch_1.9_p37/lib/python3.7/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 361, in wrapper
    return f(*args, **kwargs)
  File "/home/ubuntu/anaconda3/envs/pytorch_1.9_p37/lib/python3.7/site-packages/torch/distributed/run.py", line 698, in main
    run(args)
  File "/home/ubuntu/anaconda3/envs/pytorch_1.9_p37/lib/python3.7/site-packages/torch/distributed/run.py", line 692, in run
    )(*cmd_args)
  File "/home/ubuntu/anaconda3/envs/pytorch_1.9_p37/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 116, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/ubuntu/anaconda3/envs/pytorch_1.9_p37/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
    failures=result.failures,
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
***************************************
            ./main.py FAILED           
=======================================
Root Cause:
[0]:
  time: 2021-10-31_06:11:31
  rank: 1 (local_rank: 0)
  exitcode: 1 (pid: 31012)
  error_file: <N/A>
  msg: "Process failed with exitcode 1"
=======================================
Other Failures:
  <NO_OTHER_FAILURES>
***************************************

I suppose that it is not the expected behavior. Any help based on this information? I am using pytorch 1.9.1 with python 3.7, installed from conda.

Training script in Ubuntu Pastebin, which comes from the docker image torchelastic/example:0.2.0 with minor modification.
Launch script: NCCL_DEBUG=INFO python -m torch.distributed.run --nnodes=1:4 --nproc_per_node=1 --rdzv_id=xxxx --rdzv_backend=c10d --rdzv_endpoint=10.0.0.204:29400 ./main.py --arch resnet18 --epochs 20 --batch-size 32 --dist-backend nccl …/…/data/tiny-imagenet-200
(the tiny imagenet dataset also holds a copy in the image)

mrshenli · November 2, 2021, 3:09am

Hey @Kiuk_Chung, is this behavior expected?

Kiuk_Chung · November 2, 2021, 6:51am

hmm how do you kill the agent? using ctrl+c? if so can you try to kill by sending a SIGTERM instead? Might be related to [distributed elastic] How to tolerate agent failures with etcd rendezvous backend? · Issue #67616 · pytorch/pytorch · GitHub

Looks like in python SIGINT (send when ctrl + c from terminal) produces a KeyboardInterrupt error which results in the shutdown() method of rendezvous being called in the finally block. The shutdown() should only get called on an orderly shutdown not a “failure” hence closes the rendezvous permanently failing all other nodes with a RendezvousClosedException

Elastic was built to handle real life faults which would just SIGTERM or SIGKILL or in the worst case just cause the node itself to just crash and disappear. All of which doesn’t produce a SIGINT.

yzs · November 2, 2021, 6:16pm

Hi, KiuK!

Yes, I used Ctrl+C to kill the process and the killed process receives the Keyboard Interruption exception.

I have just tried sending SIGTERM to the main python process. However, the training process failed no matter using gloo/nccl as dist backend or c10d/etcd as rdzv backend.

The related log consistent with the former log except for the kill signal is attached: Ubuntu Pastebin

Thanks for your time and looking forward to further investigation and your demo video in the github issue!

gaocegege · November 3, 2021, 2:30am

Ref [distributed elastic] rendezvous brain split with etcd backend · Issue #67616 · pytorch/pytorch · GitHub

@Kiuk_Chung Seems that SIGTERM does not work, too. We need to use SIGKILL. Can we remove the code pytorch/api.py at cd51d2a3ecc8ac579bee910f6bafe41a4c41ca80 · pytorch/pytorch · GitHub to avoid shutdown when the agent received SIGTERM?

Kiuk_Chung · November 3, 2021, 2:56am

I see thanks for the investigation. That code was added in: [torchelastic] Improve process termination logic (#61602) · pytorch/pytorch@0c55f1b · GitHub

This only affects worker (not agent) exception handling so I’m still not understanding why you suspect this is the reason why scale down is not working when a SIGTERM is sent to the agent

Kiuk_Chung · November 3, 2021, 3:01am

Oh I mis read the code. The signal handler is registered in the agent process. I have to talk to @aivanou who added this piece of logic. I believe that termination handler was added to make sure there are no orphaned trainers when the agent gets signaled. Instead of removing the termination handler we probably need to catch the SignalException in the main loop and avoid the finally block that shuts down rdzv

Kiuk_Chung · November 3, 2021, 3:41am

I can confirm this is indeed a bug. Please track the progress of the fix: [torch/elastic] Scale down does not work correctly when agent is killed with SIGINT, SIGTERM · Issue #67742 · pytorch/pytorch · GitHub. The fix itself is quite simple.

yzs · November 3, 2021, 6:31am

It solved! Thank you all! @Kiuk_Chung and @gaocegege !

yzs · November 3, 2021, 6:32am

And also @mrshenli ! (new user cannot mention more than two people in a post!)

drcege · April 24, 2023, 4:04am

@Kiuk_Chung Hi, how can we get the true world_size after scale-down?

For example, if I use --nnodes=1:2 to start two torchrun scripts on two nodes. After killing one scripts on a node, how does the other node know the world_size now is 1.

I tried os.environ['WORLD_SIZE'] and torch.distributed.get_world_size(). They both report world_size is 2.

Kiuk_Chung · April 24, 2023, 5:41pm

When you scale down, torchelastic should re-rendezvous by first killing all its local workers then waiting on a membership barrier within monitor_interval seconds (defaults to 30sec). At which point the agent on the surviving node will realize that it is by itself and since min_nodes=1 (per --nnodes=1:2) it’ll go ahead and re-spawn its local workers.

This script should validate:

# train.py

import os
import torch.distributed as dist

if __name__ == "__main__":
    dist.init_process_group()
    
    print(f"Restart Count: {os.environ['TORCHELASTIC_RESTART_COUNT']} Rank: {dist.get_rank()} World Size: {dist.get_world_size()} Node Rank: {os.environ['GROUP_RANK']}")

Should print something like (assuming you’re running --nnodes=1:2 --nproc_per_node=1:

Restart Count: 0 Rank: 0 World Size: 2 Node Rank: 0
Restart Count: 0 Rank: 1 World Size: 2 Node Rank: 1

then after a restart

Restart Count: 0 Rank: 0 World Size: 1 Node Rank: 0