I was using my code without any problem until now.
Tried running more than one process in a single machine.
It causes error saying the port is already in use, and I solved by setting
CUDA_VISIBLE_DEVICES=1,2 python3 -m torch.distributed.launch --nproc_per_node=2 --master_port=$RANDOM
As I’ve already mentioned, I had no problem… until now…
**********************************************************************
CHILD PROCESS FAILED WITH NO ERROR_FILE
**********************************************************************
CHILD PROCESS FAILED WITH NO ERROR_FILE
Child process 12190 (local_rank 0) FAILED (exitcode 1)
Error msg: Process failed with exitcode 1
Without writing an error file to <N/A>.
While this DOES NOT affect the correctness of your application,
no trace information about the error will be available for inspection.
Consider decorating your top level entrypoint function with
torch.distributed.elastic.multiprocessing.errors.record. Example:
from torch.distributed.elastic.multiprocessing.errors import record
@record
def trainer_main(args):
# do train
**********************************************************************
warnings.warn(_no_error_file_warning_msg(rank, failure))
Traceback (most recent call last):
File "/opt/conda/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/opt/conda/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py", line 173, in <module>
main()
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py", line 169, in main
run(args)
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/run.py", line 621, in run
elastic_launch(
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 116, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 348, in wrapper
return f(*args, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
***************************************
train.py FAILED
=======================================
Root Cause:
[0]:
time: 2021-08-16_10:44:53
rank: 0 (local_rank: 0)
exitcode: 1 (pid: 12190)
error_file: <N/A>
msg: "Process failed with exitcode 1"
=======================================
Other Failures:
[1]:
time: 2021-08-16_10:44:53
rank: 1 (local_rank: 1)
exitcode: 1 (pid: 12191)
error_file: <N/A>
msg: "Process failed with exitcode 1"
***************************************
I am running my code again with that “record” decorator on my train function as the error told me to. (but should I also add that to validation function…?)
Not sure if this will fix the error but I don’t see many threads having same error as mine…