Numpy arrays and torch.multiprocessing

I’ve been reading up on pytorch and had my mind blown by the shared memory stuff via queues with torch.Tensor and torch.multiprocessing.

In general, I’ve done a lot of numpy array processing using Python’s multiprocessing module, but the pickling of the arrays is not ideal. I’d assume that the same tricks that pytorch is using for Tensors could be carried over to pure numpy arrays? It not, what is it that stands in the way?

Thanks!

This comment indicates, that the default pickler is the reason and I’m not sure, if it’s that easy to change the pickling in numpy.

Thanks for finding that! I guess it wouldn’t be too hard to use a torch.Tensor as a container and share the array that way between processes.

What kind of numpy operations are you using? Maybe you could implement them in PyTorch?

EDIT: Also, this should be quite cheap, as numpy arrays and torch.tensors share the same underlying data.

Its image processing with really large (10’s of GBs) images. The pattern is usually to have one reader throwing chunks of the image onto a queue, a bunch of workers cranking through them and placing the result on a writer queue, and then a writer writing the result.

The workers are often times calling libraries that don’t release the GIL so you’re stuck with multiprocessing.

You want to use a sharedctypes Array. This will make your variables global variables and get around lock. You can choose to use locks in place or not. Here is documentation: https://github.com/kwlzn/python-sources/blob/master/Python-3.1.2/Lib/multiprocessing/sharedctypes.py

Thanks dgriff, was not aware of those!

Np :wink:

You might find this link very helpful: https://code.i-harness.com/en/q/787707