Multiprocessing code works using numpy but deadlocked using pytorch

I’m hitting what appears to be a deadlock when trying to make use of multiprocessing with pytorch. The equivalent numpy code works like I expect it to.

I’ve made a simplified version of my code: a pool of 4 workers executing an array-wide broadcast operation 1000 times (so ~250 each worker). The array in question is 100,000 x 3 and the broadcast operation is subtraction of all rows by a single 1 x 3 row array. The large array is a shared/global array, and the row array is different at each iteration.

The code works exactly as expected using numpy, with the pooled workers showing a 4x speedup over the equivalent for loop.

The code in pytorch, however, hits a deadlock (I assume): none of the workers complete the array broadcast operation even once.

The numpy code below prints the following:

Finished for loop over my_subtractor: took 8.1504 seconds.
Finished pool over my_subtractor: took 2.2247 seconds.

The pytorch code, on the other hand, prints this then stalls:

Finished for loop over my_subtractor: took 3.1082 seconds.
BLA
BLA
BLA
BLA

“BLA” print statements are just to show that each worker is stuck in – apparently – a deadlock state. There are exactly 4 of these: one per worker entering – and getting stuck in – an iteration.

If you feel ambitious enough to reproduce, note that it doesn’t work on Windows because it’s not wrapped around if __name__ == '__main__': (I read somewhere that you need this because of the way Windows handles launching processes). Also you will need to create an empty file called my_globals.py.

Here is the numpy code

from time import time
import numpy as np
import my_globals
from multiprocessing import Pool as ThreadPool

# shared memory by virtue of being global
my_globals.minuend = np.random.rand(100000,3)

# array to be iterated over in for loop / pool of workers
subtrahends = np.random.rand(10000,3)

# function called at each iteration (broadcast operation)
def my_subtractor(subtrahend):
    my_globals.minuend - subtrahend
    return 0

# launch for loop
ts = time()
for idx, subtrahend in enumerate(subtrahends):
    my_subtractor(subtrahend)
te = time()
print('Finished for loop over my_subtractor: took %2.4f seconds.' % (te - ts))

# launch equivalent pool of workers
ts = time()
pool = ThreadPool(4)
pool.map(my_subtractor, subtrahends)
pool.close()
pool.join()
te = time()
print('Finished pool over my_subtractor: took %2.4f seconds.' % (te - ts))

Here is the equivalent pytorch code:

from time import time
import torch
import my_globals

from torch.multiprocessing import Pool as ThreadPool

# necessary on my system because it has low limits for number of file descriptors; not recommended for most systems,
# see: https://pytorch.org/docs/stable/multiprocessing.html#file-descriptor-file-descriptor
torch.multiprocessing.set_sharing_strategy('file_system')

# shared memory by virtue of being global
my_globals.minuend = torch.rand(100000,3)

# array to be iterated over in for loop / pool of workers
subtrahends = torch.rand(10000,3)

# function called at each iteration (broadcast operation)
def my_subtractor(subtrahend, verbose=True):
    if verbose:
        print("BLA") # -- prints for every worker in the pool (so 4 times total)
    my_globals.minuend - subtrahend
    if verbose:
        print("ALB") # -- doesn't print for any worker
    return 0

# launch for loop
ts = time()
for idx, subtrahend in enumerate(subtrahends):
    my_subtractor(subtrahend, verbose=False)
te = time()
print('Finished for loop over my_subtractor: took %2.4f seconds.' % (te - ts))

# launch equivalent pool of workers
ts = time()
pool = ThreadPool(4)
pool.map(my_subtractor, subtrahends)
pool.close()
pool.join()
te = time()
print('Finished pool over my_subtractor: took %2.4f seconds.' % (te - ts))