Debugging for error from torch.distributed.run

Hi, I train my model with multi-gpu within single node.

python -u -m torch.distributed.run --nnodes=1 --nproc_per_node=gpu train.py

Unlike single GPU learning, error messages are not displayed normally during multi-GPU learning. I just get the message ChildFailedError. If I train with a single GPU without using DDP, the specific reason for the error is normally output (dimension error, out of memory error etc…).
In a multi-GPU environment, does it not tell you a specific error originally?

.Below is the error message displayed when using multi-GPU. The cause of the error found out by using Single-GPU is memory overflow.

*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
INFO:root:entering barrier
0
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 429248 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 429250 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 429251 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 429252 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 429253 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 429254 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 429255 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 1 (pid: 429249) of binary: /home/dngusdnr1/anaconda3/envs/pytorch3d/bin/python
Traceback (most recent call last):
  File "/home/dngusdnr1/anaconda3/envs/pytorch3d/lib/python3.9/runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/dngusdnr1/anaconda3/envs/pytorch3d/lib/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/dngusdnr1/anaconda3/envs/pytorch3d/lib/python3.9/site-packages/torch/distributed/run.py", line 723, in <module>
    main()
  File "/home/dngusdnr1/anaconda3/envs/pytorch3d/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
    return f(*args, **kwargs)
  File "/home/dngusdnr1/anaconda3/envs/pytorch3d/lib/python3.9/site-packages/torch/distributed/run.py", line 719, in main
    run(args)
  File "/home/dngusdnr1/anaconda3/envs/pytorch3d/lib/python3.9/site-packages/torch/distributed/run.py", line 710, in run
    elastic_launch(
  File "/home/dngusdnr1/anaconda3/envs/pytorch3d/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/dngusdnr1/anaconda3/envs/pytorch3d/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 259, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
test.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2022-02-07_17:26:52
  host      : nova004
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 429249)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

@aivanou @Kiuk_Chung ?

Take a look at the message towards the end of the paste that says To enable traceback see: https://pytorch.org/docs/stable/elastic…. You have to annotate your main function with @record to have trainer errors propagate to the launcher process.

When you are running DDP, the trainers are created as child process of the launcher. When these child processes fail with an error (exception) by default the parent process (launcher) only knows about the exit code and not the full trace info (you can’t “try-catch” exceptions between processes). The @record annotation essentially wraps your trainer’s main function with a try-catch-write-the-traceinfo-into-a-file. Then the launcher looks at these trace files and writes and error summary for each rank (if a trace file was written)

1 Like

@Hyeonuk_Woo can you please give example of train.py that does not produce any errors?

I tried to just raise an exception and got the following output: *****************************************Setting OMP_NUM_THREADS environment v - Pastebin.com

Where you can see the root exception.

Also, can you try running it with:

LOGLEVEL=INFO python -u -m torch.distributed.run --nnodes=1 --nproc_per_node=gpu train.py ?

1 Like

Have you solved this problem? I encountered the same situation.