Pyotrch distributed: Running shell command

I’m running a distributed pytorch training. Everything works like charm. I am fully utilizing all GPUs, all processes are in sync, everything is fine.
At the end of each epoch, I want to run some elaborate evaluation in a new process (not to block the training):

if args.rank == 0:
  # only for the "main" rank
  subprocess.run(['python3', 'my_eval_code.py', '--chk', 'checkpoint'])

At this point, execution stops, the new process is not started and everything just halts.

  1. Is there some interdependence between pytorch’s DDP and subprocess module?
  2. How can I start a new shell script (subprocess.run/subprocess.call/subprocess.popen) from inside a DDP process?

I also posted this question on SO.


Update (July 29th, 2021)
I changed my code to:

proc = subprocess.Popen(cmd, stderr=subprocess.PIPE, stdout=subprocess.PIPE)
print(f'\t{proc}={proc.poll()}')
try:
  proc_o, proc_e = proc.communicate(timeout=120)
  print(f'successfully communicated o={proc_o} e={proc_e} poll={proc.poll()}')
except subprocess.TimeoutExpired:
  proc.kill()
  proc_o, proc_e = proc.communicate()
  print(f'time out o={proc_o} e={proc_e} poll={proc.poll()}')

No good: the Popen command is blocking, the print of the poll command is never executed, let alone the communicate.
When I check on the job with top, I see:

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
37924 bagon     39  19   23640   2796    880 S  15.8  0.1   0:15.34 python3

Looking at the process that actually runs: I see this:

UID        PID  PPID  C STIME TTY      STAT   TIME CMD
bagon    37924 37065  1 08:00 ?        SNl    0:15 /home/bagon/.conda/envs/my_env/bin/python3 -c from multiprocessing.spawn import spawn_main; spawn_main(tracker_fd=50, pipe_handle=54) --multiprocessing-fork

It seems like there is some underlying mechanism preventing subprocess module from starting new processes.

Any help?

Thanks for posting @shaibagon. For your questions:
1.Yeah I think there’re some connection because PyTorch using multiprocessing and it has some interdependencies with subprocess
2. I think if you don’t want to block training, did you try if subprocess.popen works? could you also try multiprocessing.Process and see if that works?

I tried subprocess.popen as well I did not see any process starts and it is unclear what exactly is the problem - I don’t see any errors/feedback only that the execution halts.

I opened a bug report describing this issue along with a minimal script that preproduces this deadlock.

Fingers crossed.

I answered your issue on github torch.distributed and subprocess do not work together? · Issue #62381 · pytorch/pytorch · GitHub. Hopefully that clarifies how to use init_process_group. But I suspect this may not address the original issue brought up in this post? It is hard to say why my_eval_code.py is blocking and what exactly that code is doing. If you can find a locally reproducible example where it blocks that would help!