I have run the train.py with ddp. The dataset includes 10 datasets.
torchrun --nnodes=1 --nproc_per_node=3 --rdzv_id=100 --rdzv_backend=c10d --rdzv_endpoint=xxxx:29400 cat_train.py
But when I train about the 26000 iters (530000 train iters per epoch), it shows this:
WARNING:torch.distributed.elastic.rendezvous.dynamic_rendezvous:The node 'worker00_934678_0' has failed to send a keep-alive heartbeat to the rendezvous '100' due to an error of type RendezvousTimeoutError.
Traceback (most recent call last):
File "cat_train.py", line 264, in <module>
train(start_epoch=start_epoch, model=model, metric_fc=metric_fc, optimizer=optimizer, criterion=criterion, schedule
r=scheduler)
File "cat_train.py", line 77, in train
loss.backward()
File "/home/user/anaconda3/envs/face/lib/python3.8/site-packages/torch/_tensor.py", line 487, in backward
torch.autograd.backward(
File "/home/user/anaconda3/envs/face/lib/python3.8/site-packages/torch/autograd/__init__.py", line 200, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
File "/home/user/anaconda3/envs/face/lib/python3.8/site-packages/torch/utils/data/_utils/signal_handling.py", line 66
, in handler
_error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 936774) is killed by signal: Killed.
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 934761 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 934763 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 1 (pid: 934762) of binary: /home/user/anaconda3/envs/face/bin/python3.8
Traceback (most recent call last):
File "/home/user/anaconda3/envs/face/bin/torchrun", line 8, in <module>
sys.exit(main())
File "/home/user/anaconda3/envs/face/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
return f(*args, **kwargs)
File "/home/user/anaconda3/envs/face/lib/python3.8/site-packages/torch/distributed/run.py", line 794, in main
run(args)
File "/home/user/anaconda3/envs/face/lib/python3.8/site-packages/torch/distributed/run.py", line 785, in run
elastic_launch(
File "/home/user/anaconda3/envs/face/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/user/anaconda3/envs/face/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
cat_train.py FAILED
------------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2023-07-21_15:29:08
host : worker00
rank : 1 (local_rank: 1)
exitcode : 1 (pid: 934762)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================