Multiprocessing cuda tensor proliferation on 'cuda:0'

Hi,

My program implements sharing GPU models/tensors in a class by multiple processes. Below is an MWP:

import time
import torch
import multiprocessing as mp
import torch.multiprocessing as tmp


def work(rank, t):
    print("Working")
    time.sleep(10)
    return


class Tester(object):
    def __init__(self, num_proc):
        self.device = 'cuda:1'
        self.tensor = torch.zeros(1).to(self.device)
        self.np = num_proc

    def run(self):
        procs = [mp.Process(target=work, args=(rank, self,)) for rank in range(self.np)]
        [p.start() for p in procs]
        [p.join() for p in procs]
        # tmp.spawn(work, nprocs=self.np, args=(self,))
        

def main():
    t = Tester(4)
    t.run()


if __name__ == '__main__':
    mp.set_start_method('spawn', force=True)
    main()

Essentially, the processes are sharing a tensor which is on ‘cuda:1’. On starting the program, gpustat looks like the following:


However, just before the exit, gpustat shows the following memory use:

Now, in my actual implementation, wherein I have multiple processes sharing models on each of the GPUs in the system, this results in not enough memory on ‘cuda:0’ and some cudNN errors.

I checked if torch.multiprocessing wrapper, instead of python multiprocessing, has the same issue; and, it does.

What exactly is going wrong here?

Thank you.

Furthermore, it seems that the memory is actually filled with garbage; notice the size occupied (>2100MB) for just a torch.zeros(1)!

Moreover, I notice that when the device on which processes are run is set as ‘cuda:0’, ‘cuda:1’ gets memory allocated over as 12 MB.

What is exactly happening here?

So you means that, on exit, we initialize cuda on device 0?

Note that one way to avoid this is to use CUDA_VISIBLE_DEVICES=1 to make sure you never use other GPUs.

Thanks for the response.
However, my requirement is not running the code on one of the GPUs; indeed, I need to run the code on all the GPUs available.

But this env variable can be set on a per-process basis. From what I understand above, a process should only use one GPU no?

The problem you’re seeing is that a worker that should run only on GPU1 actually use GPU0 right?

Perfect! Thanks for this idea. It worked.

However, the other issue is still persisting. Only a torch.zeros(1), or for that matter, any small tensor takes up that huge sized memory of roughly 2000 MB. Any idea why that is happening?

Thanks.

Yes, that memory is used by the cuda driver and runtime. Just doing a CudaInit eats up all that memory :’(
This is why we try to only init on the GPU that are actually used.

Thanks for this info.
So, in actuality, I should safely leave out that much memory from my overall requirement?

Yes you need to leave out this much memory for each process.