Having "ChildFailedError"..?

I was using my code without any problem until now.
Tried running more than one process in a single machine.
It causes error saying the port is already in use, and I solved by setting
CUDA_VISIBLE_DEVICES=1,2 python3 -m torch.distributed.launch --nproc_per_node=2 --master_port=$RANDOM

As I’ve already mentioned, I had no problem… until now…

               CHILD PROCESS FAILED WITH NO ERROR_FILE                                          
CHILD PROCESS FAILED WITH NO ERROR_FILE                                                                          
Child process 12190 (local_rank 0) FAILED (exitcode 1)                                          
Error msg: Process failed with exitcode 1                                                                  
Without writing an error file to <N/A>.                                                         
While this DOES NOT affect the correctness of your application,                                
no trace information about the error will be available for inspection.                                           
Consider decorating your top level entrypoint function with                                     
torch.distributed.elastic.multiprocessing.errors.record. Example:                 
  from torch.distributed.elastic.multiprocessing.errors import record                          
  def trainer_main(args):                                                                                                                                                                                                                                                   
     # do train                                                                                                                                                                                                                                                             
  warnings.warn(_no_error_file_warning_msg(rank, failure))                                                                                                                                                                                                                  
Traceback (most recent call last):                                                                                                                                                                                                                                          
  File "/opt/conda/lib/python3.8/runpy.py", line 194, in _run_module_as_main                                                                                                                                                                                                
    return _run_code(code, main_globals, None,                                                                                                                                                                                                                              
  File "/opt/conda/lib/python3.8/runpy.py", line 87, in _run_code                                                                                                                                                                                                           
    exec(code, run_globals)                                                                                                                                                                                                                                                 
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py", line 173, in <module>                                                                                                                                                                          
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py", line 169, in main                                                                                                                                                                              
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/run.py", line 621, in run                                                                                                                                                                                  
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 116, in __call__                                                                                         
    return launch_agent(self._config, self._entrypoint, list(args))                                                                                                                                                                                                         
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 348, in wrapper           
    return f(*args, **kwargs)                                                                                                                
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
    raise ChildFailedError(                              
            train.py FAILED                                                                    
Root Cause:                                                                                                   
  time: 2021-08-16_10:44:53        
  rank: 0 (local_rank: 0)                                                                                     
  exitcode: 1 (pid: 12190)
  error_file: <N/A>
  msg: "Process failed with exitcode 1"
Other Failures:
  time: 2021-08-16_10:44:53
  rank: 1 (local_rank: 1)
  exitcode: 1 (pid: 12191)
  error_file: <N/A>
  msg: "Process failed with exitcode 1"

I am running my code again with that “record” decorator on my train function as the error told me to. (but should I also add that to validation function…?)

Not sure if this will fix the error but I don’t see many threads having same error as mine…


Were you able to run your training code with more than one process before?

The record decorator should be applied to the entrypoint of your script (e.g. def main(args)). It is not meant for fixing things, but for dumping the Python stack trace of your failed subprocess(es) to the log output so you can have more information about the problem.

Right now there is no chance for us to suggest anything since your log output does not contain any meaningful information to root cause the issue.

Also besides the record decorator, you can also the new torch.distributed.run script in place of torch.distributed.launch, and set its --log-dir, --redirects, and --tee options to dump the stdout/stderr of your worker processes to a file. You can learn more about our new launcher script here.

1 Like

Hi thx for your reply!
Yes I was able to run multiple processes with using random ports without problem…
I dont know what might be the error…

Ok. I would still recommend giving torch.distributed.run a try and see what log output you get for worker processes.

Oh right
Yes I will try that
I did run processes again without that … strangly it works now… so i have no idea what is wrong… :sweat_smile:

I have experienced the same issue tonight while training on a cluster in kubernetes. The exit code was 6 for me instead of 1. However, it is fixed again strangely after retrying this morning.