I am working with pytorch-lightning in an effort to bring objects back to the master process when using DistributedDataParallel. Lightning launches these sub-processes with torch.multiprocessing.spawn().
I ran into some issues, and decided to build a tiny model to try things out. Unfortunately I cannot seem to share a SimpleQueue when using torch.multiprocessing.spawn().
I am on Ubuntu 18.04, python 3.6.8, torch 1.4.
Here is the code:
import torch.multiprocessing as mp
def f(i, q):
print(f"in f(): {q} {q.empty()}")
print(f"{q.get()}")
if __name__ == '__main__':
q = mp.SimpleQueue()
q.put(['hello'])
p = mp.spawn(f, (q,))
print(f"main {q.empty()} {q.get()}")
This results in:
in f(): <multiprocessing.queues.SimpleQueue object at 0x7f97b7b71eb8> False
Traceback (most recent call last):
File "test.py", line 36, in <module>
p = mp.spawn(f, (q,))
File "/home/seth/.local/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 171, in spawn
while not spawn_context.join():
File "/home/seth/.local/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 107, in join
(error_index, name)
Exception: process 0 terminated with signal SIGSEGV
The SimpleQueue object appears to be valid - a dir() call shows all the right stuff - but the only call that seems to work is empty(). I haven’t tried all, but the basic ones all result in SIGSEGV. Ouch.
If I do not use mp.spawn() but instead use normal process start and join, of course it works fine. More unusually, it also works fine if I mimic what mp.spawn() does - without pytorch’s SpawnContext() version of join.
import multiprocessing as mp
def f(q):
msg(f"in f(): {q} {q.empty()}")
msg(f"{q.get()}")
if __name__ == '__main__':
mp = mp.get_context('spawn')
q = mp.SimpleQueue()
q.put(['hello'])
p = mp.Process(target=f, args= (q,))
p.start()
p.join
Is this a bug - or am I doing something wrong?