Multiprocessing shared memory

andrei-rusu · December 16, 2021, 11:12am

Hi all,

Due to a known memory limitation that causes errors on Windows when importing torch as multiple processes get spawned (refer to python - How to efficiently run multiple Pytorch Processes / Models at once ? Traceback: The paging file is too small for this operation to complete - Stack Overflow), I would like to avoid any unnecessary memory usage when training my distributed model. As such, I switched to torch.multiprocessing in the hope that at least the model parameters will be shared among processes.

Now, I would like to understand what and how are Tensors actually shared among child mp.Processes because it seems to me that the data within Tensors remains consistent throughout multiprocessing when these get passed to ‘args’ (I haven’t used mp.Queue in my tests yet), even when I’m simply importing the default ‘multiprocessing’ package, and despite omitting the ‘share_memory_()’ calls altogether? This makes me suspect the memory is not actually shared, but copied around a lot, just like using a Manager.dict() or so.

Another ‘strange’ thing that I observed is that an ‘args’ Tensor’s ‘is_shared()’ method does return False outside the mp.Process startup when no calls to ‘share_memory_()’ are being made, but returns True within the child processes to which it was sent no matter what (again, even if torch.multiprocessing is never imported). What is more, despite my best efforts in inspecting ‘data_ptr’ or the backend ‘storage’ objects, I was not able to confirm whether my Tensor data is ever truly shared or not.

These aspects also beg my final question: How would I check whether the parameters of my model are actually living in a true shared memory space or not?

Thanks for your time!