I’m running Distributed Data Parallel example in jupyter labs, and getting an error:
process 1 terminated with exit code 1
How can I fix it? Where should I look at? I tried using “nccl” or “mpi” in dist.init_process_group
, no effect. Replacing the entire body of example()
with pass
: no effect.
Full stack trace:
---------------------------------------------------------------------------
Exception Traceback (most recent call last)
<ipython-input-29-703537cd0120> in <module>
33 join=True)
34
---> 35 main()
<ipython-input-29-703537cd0120> in main()
28 def main():
29 world_size = 2
---> 30 mp.spawn(example,
31 args=(world_size,),
32 nprocs=world_size,
/usr/local/lib/python3.8/site-packages/torch/multiprocessing/spawn.py in spawn(fn, args, nprocs, join, daemon, start_method)
197 ' torch.multiprocessing.start_process(...)' % start_method)
198 warnings.warn(msg)
--> 199 return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
/usr/local/lib/python3.8/site-packages/torch/multiprocessing/spawn.py in start_processes(fn, args, nprocs, join, daemon, start_method)
155
156 # Loop on join until it returns True or raises an exception.
--> 157 while not context.join():
158 pass
159
/usr/local/lib/python3.8/site-packages/torch/multiprocessing/spawn.py in join(self, timeout)
108 )
109 else:
--> 110 raise Exception(
111 "process %d terminated with exit code %d" %
112 (error_index, exitcode)
Exception: process 1 terminated with exit code 1