My program implements sharing GPU models/tensors in a class by multiple processes. Below is an MWP:
import time
import torch
import multiprocessing as mp
import torch.multiprocessing as tmp
def work(rank, t):
print("Working")
time.sleep(10)
return
class Tester(object):
def __init__(self, num_proc):
self.device = 'cuda:1'
self.tensor = torch.zeros(1).to(self.device)
self.np = num_proc
def run(self):
procs = [mp.Process(target=work, args=(rank, self,)) for rank in range(self.np)]
[p.start() for p in procs]
[p.join() for p in procs]
# tmp.spawn(work, nprocs=self.np, args=(self,))
def main():
t = Tester(4)
t.run()
if __name__ == '__main__':
mp.set_start_method('spawn', force=True)
main()
Essentially, the processes are sharing a tensor which is on ‘cuda:1’. On starting the program, gpustat looks like the following:
Now, in my actual implementation, wherein I have multiple processes sharing models on each of the GPUs in the system, this results in not enough memory on ‘cuda:0’ and some cudNN errors.
I checked if torch.multiprocessing wrapper, instead of python multiprocessing, has the same issue; and, it does.
However, the other issue is still persisting. Only a torch.zeros(1), or for that matter, any small tensor takes up that huge sized memory of roughly 2000 MB. Any idea why that is happening?
Yes, that memory is used by the cuda driver and runtime. Just doing a CudaInit eats up all that memory :’(
This is why we try to only init on the GPU that are actually used.