Having "ChildFailedError"..?

ooodragon · August 16, 2021, 11:49am

I was using my code without any problem until now.
Tried running more than one process in a single machine.
It causes error saying the port is already in use, and I solved by setting
CUDA_VISIBLE_DEVICES=1,2 python3 -m torch.distributed.launch --nproc_per_node=2 --master_port=$RANDOM

As I’ve already mentioned, I had no problem… until now…

**********************************************************************             
               CHILD PROCESS FAILED WITH NO ERROR_FILE                                          
**********************************************************************                         
CHILD PROCESS FAILED WITH NO ERROR_FILE                                                                          
Child process 12190 (local_rank 0) FAILED (exitcode 1)                                          
Error msg: Process failed with exitcode 1                                                                  
Without writing an error file to <N/A>.                                                         
While this DOES NOT affect the correctness of your application,                                
no trace information about the error will be available for inspection.                                           
Consider decorating your top level entrypoint function with                                     
torch.distributed.elastic.multiprocessing.errors.record. Example:                 
                                                                                               
  from torch.distributed.elastic.multiprocessing.errors import record                          
                                                                                                                                                                                                                                                                            
  @record                                                                                                                                                                                                                                                                  
  def trainer_main(args):                                                                                                                                                                                                                                                   
     # do train                                                                                                                                                                                                                                                             
**********************************************************************                                                                                                                                                                                                      
  warnings.warn(_no_error_file_warning_msg(rank, failure))                                                                                                                                                                                                                  
Traceback (most recent call last):                                                                                                                                                                                                                                          
  File "/opt/conda/lib/python3.8/runpy.py", line 194, in _run_module_as_main                                                                                                                                                                                                
    return _run_code(code, main_globals, None,                                                                                                                                                                                                                              
  File "/opt/conda/lib/python3.8/runpy.py", line 87, in _run_code                                                                                                                                                                                                           
    exec(code, run_globals)                                                                                                                                                                                                                                                 
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py", line 173, in <module>                                                                                                                                                                          
    main()                                                                                                                                                                                                                                                                  
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py", line 169, in main                                                                                                                                                                              
    run(args)                                                                                                                                                                                                                                                               
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/run.py", line 621, in run                                                                                                                                                                                  
    elastic_launch(                                                                                                                                                                                                                                                         
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 116, in __call__                                                                                         
    return launch_agent(self._config, self._entrypoint, list(args))                                                                                                                                                                                                         
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 348, in wrapper           
    return f(*args, **kwargs)                                                                                                                
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
    raise ChildFailedError(                              
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:                
***************************************                                                       
            train.py FAILED                                                                    
=======================================                                                                                                                                                                 
Root Cause:                                                                                                   
[0]:                                                                                                        
  time: 2021-08-16_10:44:53        
  rank: 0 (local_rank: 0)                                                                                     
  exitcode: 1 (pid: 12190)
  error_file: <N/A>
  msg: "Process failed with exitcode 1"
=======================================
Other Failures:
[1]:
  time: 2021-08-16_10:44:53
  rank: 1 (local_rank: 1)
  exitcode: 1 (pid: 12191)
  error_file: <N/A>
  msg: "Process failed with exitcode 1"
***************************************

I am running my code again with that “record” decorator on my train function as the error told me to. (but should I also add that to validation function…?)

Not sure if this will fix the error but I don’t see many threads having same error as mine…

cbalioglu · August 16, 2021, 1:48pm

Hi,

Were you able to run your training code with more than one process before?

The record decorator should be applied to the entrypoint of your script (e.g. def main(args)). It is not meant for fixing things, but for dumping the Python stack trace of your failed subprocess(es) to the log output so you can have more information about the problem.

Right now there is no chance for us to suggest anything since your log output does not contain any meaningful information to root cause the issue.

Also besides the record decorator, you can also the new torch.distributed.run script in place of torch.distributed.launch, and set its --log-dir, --redirects, and --tee options to dump the stdout/stderr of your worker processes to a file. You can learn more about our new launcher script here.

ooodragon · August 16, 2021, 2:05pm

Hi thx for your reply!
Yes I was able to run multiple processes with using random ports without problem…
I dont know what might be the error…

cbalioglu · August 16, 2021, 6:01pm

Ok. I would still recommend giving torch.distributed.run a try and see what log output you get for worker processes.

ooodragon · August 17, 2021, 1:24am

Oh right
Yes I will try that
I did run processes again without that … strangly it works now… so i have no idea what is wrong…

hasansalimkanmaz · March 18, 2022, 9:05am

I have experienced the same issue tonight while training on a cluster in kubernetes. The exit code was 6 for me instead of 1. However, it is fixed again strangely after retrying this morning.

Zhang_Kin · March 20, 2023, 4:58pm

I met this error today, is because I using the cluster which limits the RAM. When I asked for more RAM, it works well now. In Slurm Job submission System.

mohit_dulani · May 10, 2023, 12:57pm

What exactly you did @ooodragon to overcome this code ??
Please share the code if possible !!
@cbalioglu Please help

pvtien96 · August 5, 2023, 8:19pm

I got a similar error but with exitcode: -11

[2023-08-05 22:18:08 internimage_t_1k_224](main.py 307): INFO Start training
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -11) local_rank: 0 (pid: 33003) of binary: /home/van-tien.pham/anaconda3/bin/python
Traceback (most recent call last):
  File "/home/van-tien.pham/anaconda3/lib/python3.9/runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/van-tien.pham/anaconda3/lib/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/van-tien.pham/anaconda3/lib/python3.9/site-packages/torch/distributed/launch.py", line 196, in <module>
    main()
  File "/home/van-tien.pham/anaconda3/lib/python3.9/site-packages/torch/distributed/launch.py", line 192, in main
    launch(args)
  File "/home/van-tien.pham/anaconda3/lib/python3.9/site-packages/torch/distributed/launch.py", line 177, in launch
    run(args)
  File "/home/van-tien.pham/anaconda3/lib/python3.9/site-packages/torch/distributed/run.py", line 785, in run
    elastic_launch(
  File "/home/van-tien.pham/anaconda3/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/van-tien.pham/anaconda3/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
=======================================================
main.py FAILED
-------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
-------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-08-05_22:18:15
  host      : siim-3.lis-lab.fr
  rank      : 0 (local_rank: 0)
  exitcode  : -11 (pid: 33003)
  error_file: <N/A>
  traceback : Signal 11 (SIGSEGV) received by PID 33003
=======================================================

djaym7 · September 22, 2023, 6:56pm

Same -11 error, any solution ?

Wassim_Jaoui · March 19, 2024, 3:34am

I had exiterror=1
I found out that I was running my code in an uncorrect environment, I had defined everything in anaconda before.
conda activate nameEnvironment
to install GPU with pytorch:
conda install pytorch torchvision torchaudio pytorch-cuda=11.8 -c pytorch -c nvidia
instead of pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
it resolved the problem for me.
Get the suitable installation command here: