PyTorch Distributed is going out of CPU RAM

maralm · April 11, 2020, 12:56am

I have two separate models in my algorithm. A large model that resides on CPU and a small model that goes to GPU. I am using the DDP to train the small model on multiple GPUs while the large model remains on CPU. I have 4 GPUs and I observe that the CPU model also loads 4 times and cause OOM in CPU. Is there anyway to keep a sigle CPU model and multiple GPU models in DDP?

mrshenli · April 11, 2020, 6:54pm

Do you mean you want to share the same CPU model across 4 DDP processes? If so, you can use torch.multiprocessing.Queue to share the model. Does the CPU model participate in training or just inference?

maralm · May 3, 2020, 11:39pm

sorry for late reply, missed your comment. Yes, I want to use DDP but have a single copy in the CPU. my model runs forward and backward on GPU and optimizer on CPU.

mrshenli · May 4, 2020, 2:48pm

Yes, I want to use DDP but have a single copy in the CPU.

In this case, you might need to use multiprocess communication to share tensors.

Sorry that I still don’t fully understand the use case. Some pseudo code would be helpful.

maralm · May 5, 2020, 10:22pm

Thank you. I am trying to load the cpu model in one thread and broadcast in gpu. However, I have an issue with the barrier. Here is my code:

if args.local_rank in [-1,0]:
    for net1,net2 in zip(self.encoder.layer[0].named_parameters(), model_cpu.bert.encoder.layer[i].named_parameters()):
         net1[1].data.copy_(net2[1].data.clone(), non_blocking=True)
 
torch.distributed.barrier()

if args.local_rank not in [-1,0]:
    for name, p in self.encoder.layer[0].named_parameters():
        torch.distributed.broadcast(p, src=0)

my code stucks at the barrier. Any idea what could be wrong with my code? Should I even use barrier?

mrshenli · May 6, 2020, 1:13am

The default process group is per-process object, instead of per-thread. Is this just a typo and you actually mean “process”?

if args.local_rank not in [-1,0]:
    for name, p in self.encoder.layer[0].named_parameters():
        torch.distributed.broadcast(p, src=0)

For collective communications, it requires all ranks to make the same number of c10d API invocations in the same order. It seems that, with the above code, rank 0 is not participating in the broadcast? If you need broadcast in a subgroup, you will need to first create a subgroup using the new_group API and then call boradcast in that group.

maralm · May 6, 2020, 1:27am

Sorry, yes I mean process. I was thinking to load the model on cpu in one process, transfer it to the gpu 0 (which is in the same process) and from gpu 0, copy weights to other gpus in other processes. Broadcasting is the correct way to do so?

mrshenli · May 6, 2020, 1:31am

I was thinking to load the model on cpu in one process, transfer it to the gpu 0 (which is in the same process) and from gpu 0, copy weights to other gpus in other processes. Broadcasting is the correct way to do so?

Yes, this looks correct to me. DistributedDataParallel is actually doing the similar thing in its constructor.

maralm · May 7, 2020, 7:45pm

Thank you, it worked!