Crazy deadlock bug in child process

yousiyu · July 17, 2022, 6:13pm

import torch as tc
import torch.multiprocessing as mp

def do_something():
    b = tc.randn(1000, 1000)
    print(f'A tensor is created in process {mp.current_process().name}')
    print(b.sum())  # This line hangs forever

def bug():
    # a = tc.randn(1000, 1000)
    a = tc.ones(1000, 1000)  # comment out this line and revive the line above will make the bug go away
    p = mp.Process(target=do_something, args=())
    p.start()
    p.join()

if __name__ == '__main__':
    bug()

yousiyu · July 17, 2022, 6:19pm

My torch version is 1.10.2
It is crazy that creating a tensor in parent process causes a deadlock in the child process, and the tensor is not even passed to the child process.
It is even crazier that if I replace the tensor in parent process with anther one, the deadlock goes away!
Is it generally not recommended to use forked subprocess in pytorch?

thanks

Fei_Liu · July 17, 2022, 7:22pm

I ran your code and did not see any bugs. The code just completed smoothly. My torch is 1.10.0

Senlody · February 1, 2024, 7:00am

I’m having a similar problem with python3.10 and torch 2.0/2.2 (didn’t try 2.1), here is my test code

import torch
import multiprocessing
import numpy as np

def test(n):
    x=np.zeros((n,n))
    print('tensor')
    x=torch.tensor(x) # the subprocess stucks here
    print('done')
    return x

def main():
    tensor = torch.zeros(200,1000)
    # tensor = torch.zeros(200,100) # this works, seems the bug goes away with small tensor
    
    # neither torch.multiprocessing nor built-in multiprocessing works
    # pool=torch.multiprocessing.Pool(1)
    pool=multiprocessing.Pool(1)
    pool.apply(test,(1000,))
    return

main()

Did you figure out a solution, or have any clues?

thanks

edit: The bug seems happens on linux, but goes away on win

edit2: found the cause here