Cuda context size per process

seliad · May 6, 2020, 11:33am

Hi,
I noticed that there is a large (~500MB) cuda context created per process on GPU.
can see it simply by doing:

import torch
torch.randn(1,device=0)

it takes 500MB (used to take 750MB in previous versions).
when multiprocessing on same GPU this is a lot of unneeded memory.
How can we work around this?

mrshenli · May 6, 2020, 2:08pm

Do multiple processes have to work on the same set of GPUs? Can each process work on an exclusive set of GPUs and use CUDA_VISIBLE_DEVICES to control which devices they see?

seliad · May 6, 2020, 2:30pm

I explicitly want multi processes using the same GPU.

mrshenli · May 6, 2020, 2:41pm

I am not aware if there is a way to avoid the per-process CUDA context or reduce its size. @ptrblck and @albanD might know more.

ptrblck · May 7, 2020, 12:20am

I don’t think it’s possible to reuse a single CUDA context between processes, but haven’t looked deeply into it.
We expect the best performance using a single process per GPU.
What is your use case @seliad that you want to use multiple processes on the same device?
Are you seeing any performance gains (regardless of the wasted memory)?

seliad · May 7, 2020, 6:00am

For processes: A,B
Each has to do
(1) distributed communication (e.g A–>C , B->D)
(2) share parameters (A<->B)

For the distributed communication I need different ranks (Im currently using cuda-aware MPI).

Even if there is an option to use distributed communication with threads (i think that there is in mpi, not sure if Pytorch supports it), in python it is a pretty bad Idea.

seliad · May 13, 2020, 2:41pm

I also noticed that when using 2 processes communicating through a Queue,
The sender process (e.g sending from device0 to device1, with copy_) thholds this cuda contex on both devices. I created the buffer at the sender and sent it through a queue (reusing it) as recommended in the docs.
(push communication model)

Another option is to have a thread in the receiver waiting on that queue and pulling from it, I guess that won’t cause this extra memory?

So, I wonder, maybe it could be (theoretically, and very partially) solved by creating all tensors in a single process (single owner), and sending them to all processes sharing the device.