Multiprocessing cuda tensor proliferation on 'cuda:0'

bapi · February 13, 2020, 3:17pm

Hi,

My program implements sharing GPU models/tensors in a class by multiple processes. Below is an MWP:

import time
import torch
import multiprocessing as mp
import torch.multiprocessing as tmp


def work(rank, t):
    print("Working")
    time.sleep(10)
    return


class Tester(object):
    def __init__(self, num_proc):
        self.device = 'cuda:1'
        self.tensor = torch.zeros(1).to(self.device)
        self.np = num_proc

    def run(self):
        procs = [mp.Process(target=work, args=(rank, self,)) for rank in range(self.np)]
        [p.start() for p in procs]
        [p.join() for p in procs]
        # tmp.spawn(work, nprocs=self.np, args=(self,))
        

def main():
    t = Tester(4)
    t.run()


if __name__ == '__main__':
    mp.set_start_method('spawn', force=True)
    main()

Essentially, the processes are sharing a tensor which is on ‘cuda:1’. On starting the program, gpustat looks like the following:

However, just before the exit, gpustat shows the following memory use:

Now, in my actual implementation, wherein I have multiple processes sharing models on each of the GPUs in the system, this results in not enough memory on ‘cuda:0’ and some cudNN errors.

I checked if torch.multiprocessing wrapper, instead of python multiprocessing, has the same issue; and, it does.

What exactly is going wrong here?

Thank you.

bapi · February 13, 2020, 6:23pm

Furthermore, it seems that the memory is actually filled with garbage; notice the size occupied (>2100MB) for just a torch.zeros(1)!

Moreover, I notice that when the device on which processes are run is set as ‘cuda:0’, ‘cuda:1’ gets memory allocated over as 12 MB.

What is exactly happening here?

albanD · February 13, 2020, 6:28pm

So you means that, on exit, we initialize cuda on device 0?

Note that one way to avoid this is to use CUDA_VISIBLE_DEVICES=1 to make sure you never use other GPUs.

bapi · February 13, 2020, 6:31pm

Thanks for the response.
However, my requirement is not running the code on one of the GPUs; indeed, I need to run the code on all the GPUs available.

albanD · February 13, 2020, 7:06pm

But this env variable can be set on a per-process basis. From what I understand above, a process should only use one GPU no?

The problem you’re seeing is that a worker that should run only on GPU1 actually use GPU0 right?

bapi · February 13, 2020, 7:37pm

Perfect! Thanks for this idea. It worked.

However, the other issue is still persisting. Only a torch.zeros(1), or for that matter, any small tensor takes up that huge sized memory of roughly 2000 MB. Any idea why that is happening?

Thanks.

albanD · February 13, 2020, 7:40pm

Yes, that memory is used by the cuda driver and runtime. Just doing a CudaInit eats up all that memory :’(
This is why we try to only init on the GPU that are actually used.

bapi · February 13, 2020, 7:42pm

Thanks for this info.
So, in actuality, I should safely leave out that much memory from my overall requirement?

albanD · February 13, 2020, 7:49pm

Yes you need to leave out this much memory for each process.