Sharing model between processes automatically allocates new memory

I’ve encounter a problem when sharing a model between processes, and it is critical to me (for memory resources).

I’ve been sharing a model between several processes (on linux, ubuntu). The model is used only for a forward pass, since it performs some sort of pre-processing for the samples (before fed to a different network). I’ve done everything I can to ensure that - the model is on eval mode, each parameter has ‘grad’ flag False, and forward is under ‘with torch.no_grad():’.

The problem is that after the new process is spawned, for some reason it allocates new memory on the GPU. At first I thought this memory is intermediate values of the computational graph, but then I noticed each process still allocates new memory on GPU, even when sleep is invoked (i.e. before even running data through the model), the memory is still allocated. Further more, it is a lot of memory relative to the model! The model is about 4GB (lets say 2GB of weights and 2GB of optimizer), the memory allocated is 1GB (!), which may also indicate that the network is not completely replicated, only a part of it.

Here is an example code, it think it contains the must critical parts of what I’m doing

def inferrerFunc(neuralNetwork):
	#If we use sleep here, memory is still allocated on GPU
	#Imagine theres a dataset here,,,
	for x in dataset:
		y_t = neuralNetwork(x)

class mainProc():
	def __init__(self):
		self.neuralNetwork = neuralNetwork()

		torch.multiprocessing.set_start_method('spawn', force=True)

	def startInferrer(self):

		self.inferrer = torch.multiprocessing.Process(target = inferrerFunc, args = (self.neuralNetwork,))

1 Like

When you pass self.neuralNetwork as a parameter, I believe it is pickled and then unpickled in the child process. The unpickling process most re allocating some memory. Note that share_memory only applies to CPU tensors and not GPU tensors. The pickling usually happens in spawn mode, you can try and use fork to see if that resolves the issue.

It turns out that every-time a process holds any pytorch object that is allocated on the GPU, then it allocates an individual copy of all the kernels (cuda functions) that pytorch uses, which is about 1GB.
It seems there is no way around it, and if your machine has Xgb of GPU RAM, then you’re limited to X processes. The only way around it is dedicating one process to hold the pytorch module and act with the other processes in a producer-consumers pattern, which is a real headache when it comes to scalability and much more for RT application :frowning:.

Since this seems like a memory limitation imposed by PyTorch, feel free to file a GitHub issue over at It would be valuable to have a repro where extra memory is allocated unexpectedly.

It is a known issue and as I understand changing that requires a massive change so its not even on the agenda.