Root Cause (first observed failure):

I have run the train.py with ddp. The dataset includes 10 datasets.

torchrun --nnodes=1 --nproc_per_node=3 --rdzv_id=100 --rdzv_backend=c10d --rdzv_endpoint=xxxx:29400 cat_train.py 

But when I train about the 26000 iters (530000 train iters per epoch), it shows this:

WARNING:torch.distributed.elastic.rendezvous.dynamic_rendezvous:The node 'worker00_934678_0' has failed to send a keep-alive heartbeat to the rendezvous '100' due to an error of type RendezvousTimeoutError.
Traceback (most recent call last):
  File "cat_train.py", line 264, in <module>
    train(start_epoch=start_epoch, model=model, metric_fc=metric_fc, optimizer=optimizer, criterion=criterion, schedule
r=scheduler)
  File "cat_train.py", line 77, in train
    loss.backward()
  File "/home/user/anaconda3/envs/face/lib/python3.8/site-packages/torch/_tensor.py", line 487, in backward
    torch.autograd.backward(
  File "/home/user/anaconda3/envs/face/lib/python3.8/site-packages/torch/autograd/__init__.py", line 200, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
  File "/home/user/anaconda3/envs/face/lib/python3.8/site-packages/torch/utils/data/_utils/signal_handling.py", line 66
, in handler
    _error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 936774) is killed by signal: Killed.
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 934761 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 934763 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 1 (pid: 934762) of binary: /home/user/anaconda3/envs/face/bin/python3.8
Traceback (most recent call last):
  File "/home/user/anaconda3/envs/face/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/home/user/anaconda3/envs/face/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/home/user/anaconda3/envs/face/lib/python3.8/site-packages/torch/distributed/run.py", line 794, in main
    run(args)
  File "/home/user/anaconda3/envs/face/lib/python3.8/site-packages/torch/distributed/run.py", line 785, in run
    elastic_launch(
  File "/home/user/anaconda3/envs/face/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/user/anaconda3/envs/face/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
cat_train.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-07-21_15:29:08
  host      : worker00
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 934762)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

A SIGKILL is often sent to a process from the OS if it’s running out of memory.
Could you check in e.g. dmesg is an out of memory issue was detected and the OS was forced to kill your process?

Thanks for your advice.

dmesg -T | grep -i "out of memory"
[Fri Jul 21 15:28:44 2023] Out of memory: Killed process 936774 (python3.8) total-vm:73594860kB, anon-rss:44739804kB, file-rss:150052kB, shmem-rss:24kB, UID:1001 pgtables:90132kB oom_score_adj:0

Thanks for confirming! In that case you should check if you are storing unnecessary data in the host RAM and if you could decrease it.

Thanks. I am using single machine with multi-GPU training. How can I check the host RAM?

You could track it during the training run in another terminal e.g. via htop.

I get it!
Thanks for your help.