I’m running a distributed pytorch training. Everything works like charm. I am fully utilizing all GPUs, all processes are in sync, everything is fine.
At the end of each epoch, I want to run some elaborate evaluation in a new process (not to block the training):
if args.rank == 0:
# only for the "main" rank
subprocess.run(['python3', 'my_eval_code.py', '--chk', 'checkpoint'])
At this point, execution stops, the new process is not started and everything just halts.
- Is there some interdependence between pytorch’s DDP and
subprocess
module? - How can I start a new shell script (
subprocess.run
/subprocess.call
/subprocess.popen
) from inside aDDP
process?
I also posted this question on SO.
Update (July 29th, 2021)
I changed my code to:
proc = subprocess.Popen(cmd, stderr=subprocess.PIPE, stdout=subprocess.PIPE)
print(f'\t{proc}={proc.poll()}')
try:
proc_o, proc_e = proc.communicate(timeout=120)
print(f'successfully communicated o={proc_o} e={proc_e} poll={proc.poll()}')
except subprocess.TimeoutExpired:
proc.kill()
proc_o, proc_e = proc.communicate()
print(f'time out o={proc_o} e={proc_e} poll={proc.poll()}')
No good: the Popen
command is blocking, the print of the poll
command is never executed, let alone the communicate
.
When I check on the job with top
, I see:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
37924 bagon 39 19 23640 2796 880 S 15.8 0.1 0:15.34 python3
Looking at the process that actually runs: I see this:
UID PID PPID C STIME TTY STAT TIME CMD
bagon 37924 37065 1 08:00 ? SNl 0:15 /home/bagon/.conda/envs/my_env/bin/python3 -c from multiprocessing.spawn import spawn_main; spawn_main(tracker_fd=50, pipe_handle=54) --multiprocessing-fork
It seems like there is some underlying mechanism preventing subprocess
module from starting new processes.
Any help?