Multiprocessing CUDA memory

Hi !

I’m currently using multiprocessing in a project, and I was wondering if I had a way not to reinitialize CUDA on every process (which takes approximately ~300Mo of VRAM from what I saw).
I send models to the processes and dont expect to get anything back that is related to PyTorch.
I already tried using Queues, share_memory and sending the state_dict, but none of these worked.
From what I read so far, it seems that it is unavoidable, is this true ? And if not, what would be a workaround ?

Thanks, and sorry for the question.


Are you passing cuda tensors? Did you wrap cuda initializing code in if __name__ == '__main__':?

I managed to reproduce it.
Any idea on what I’m doing wrong / not understanding ?

Thanks !

1 Like

Oh you are indeed using cuda in each process so I don’t think that is avoidable.

@dylandjian Why are you running the same GPU from multiple processes? Typically GPUs will run faster if they are just doing one thing at a time, and in case, AFAIK, will only run a single kernel at a time. Is the multiprocessing because there is a substantial cpu component to deal with, eg some kind of preprocessing on each batch or similar?

@hughperkins Well, I’m trying to train an agent with an evolutionary algorithm (CMA-ES in my case). In order to evaluate the fitness of the potential solution for the parameters of the model (a simple linear combination or 2 linear layers max), I have to run the environment which is CPU bound. So to do the evaluation “efficiently”, I spawn many processes to evaluate multiple solutions at a time.
PyTorch is probably overkill to run such simple models, but it was easier for me to implement !
Also, I have 2 relatively “big” models (VAE, LSTM) that I have to pass to the child process, but shared memory is completely ok for these 2.

Can you use a multiple actor single learner model, something like impala, , so that the actors run in separate processes, to maximize cpu usage, meanwhile the learner runs in a single gpu using process?

I think that is already what I’m doing.
In the gist that I linked above, I just pass the agent to an evaluator function (the Test class) that runs in another process, from the main learner process that is responsible for exploiting the reward sent back from the evaluator. The evaluator function only uses the agent to make forward passes and gather the final reward at the end. I use a torch.no_grad in my evaluator function to make sure that no gradients are being calculated !
Is this what you mean ?

Ah, possibly. It sounds like the evaluator is pretty simple, and could run on the cpu, with little/no performance hit? (might even be faster, since no need to move data back and forth to the gpu?).

The evaluator could run on CPU but with a big performance hit because model 1 is a big Convolutional VAE, and model 2 is a big LSTM, so inference time gets pretty long on CPU since it also has to emulate the environment !
Is it required to re-initialized CUDA entirely on every new process even if we just want to infer to an already allocated memory address on a process that has itself already initialized CUDA ?
If so, any tips to reduce the memory footprint of the initialization (and in general perhaps) ?
I am currently at 414MiB per new process.
Thanks !

I guess there are three possible ways of answering this:

  1. is it possible to avoid initialization? Possibly out of my own expertise to give a definitive answer. Simon Wang states that it’s not. That correlates with my own experience.

  2. is it theoretically possible to reduce the footprint of initializing CUDA; and is it practically possible/realistic for pytorch to do so? For the first point, possibly out of my own experience; you could compile and run a simple C++ CUDA app that simply initializes cuda. There’s one at , which you can probably compile using nvcc or similar. I guess that the intrinsic overhead is low. Is it practically possible/realistic for pytorch to do so. Definitely outside of my expertise. You’d need to make a compelling case that you’d exhausted all available options I would imagine, and that other people might encounter a similar issue

  3. As far as ‘exhausting all available options’, I guess this is the bit I’m addressing mostly :slight_smile: Even if you somehow initialize CUDA with minimal overhead from eg 64 different processes, if they’re all running against the same GPU, the GPU bit is basically going to be sequential. In that case, could it be possible to have a single central process, that communicates via shared memory etc with the actor processes, and which handles the gpu processing for those multiple actor processes? I imagine this would run at the same speed as your ideal easy-to-program scenario, whilst avoiding the initialization issue?


Hi, I know this is 2.5 years old, but did you ever find a solution for this? I believe I am doing the exact same thing as you: evolving a CNN and training each model on its own process in the multiprocessing package. CUDA is initializing for each one and drastically reducing performance. Any tips that you can remember?