Distributed Data Parallel example - "process 0 terminated with exit code 1"

I’m running Distributed Data Parallel example in jupyter labs, and getting an error:

process 1 terminated with exit code 1

How can I fix it? Where should I look at? I tried using “nccl” or “mpi” in dist.init_process_group, no effect. Replacing the entire body of example() with pass: no effect.

Full stack trace:

Exception                                 Traceback (most recent call last)
<ipython-input-29-703537cd0120> in <module>
     33         join=True)
---> 35 main()

<ipython-input-29-703537cd0120> in main()
     28 def main():
     29     world_size = 2
---> 30     mp.spawn(example,
     31         args=(world_size,),
     32         nprocs=world_size,

/usr/local/lib/python3.8/site-packages/torch/multiprocessing/spawn.py in spawn(fn, args, nprocs, join, daemon, start_method)
    197                ' torch.multiprocessing.start_process(...)' % start_method)
    198         warnings.warn(msg)
--> 199     return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')

/usr/local/lib/python3.8/site-packages/torch/multiprocessing/spawn.py in start_processes(fn, args, nprocs, join, daemon, start_method)
    156     # Loop on join until it returns True or raises an exception.
--> 157     while not context.join():
    158         pass

/usr/local/lib/python3.8/site-packages/torch/multiprocessing/spawn.py in join(self, timeout)
    108                 )
    109             else:
--> 110                 raise Exception(
    111                     "process %d terminated with exit code %d" %
    112                     (error_index, exitcode)

Exception: process 1 terminated with exit code 1

Thanks for posting @dyukha, I am seeing that you are running the code inside a jupyter notebook, I think the error you hit here is not a problem with PyTorch distributed, this is related to the incompatibility between python multiprocessing module and Jupyter Notebook because multiprocessing module pickles data to send to processes. If you can try the script in a normal python file it should work fine.

1 Like

Thanks for the reply! Is there a way to make it work in Jupyter?

is this still an issue?

You could try multiprocessing.Pool and see if that works Multiprocessing on Python 3 Jupyter - Stack Overflow