PyTorch Distributed is going out of CPU RAM

I have two separate models in my algorithm. A large model that resides on CPU and a small model that goes to GPU. I am using the DDP to train the small model on multiple GPUs while the large model remains on CPU. I have 4 GPUs and I observe that the CPU model also loads 4 times and cause OOM in CPU. Is there anyway to keep a sigle CPU model and multiple GPU models in DDP?

Do you mean you want to share the same CPU model across 4 DDP processes? If so, you can use torch.multiprocessing.Queue to share the model. Does the CPU model participate in training or just inference?

sorry for late reply, missed your comment. Yes, I want to use DDP but have a single copy in the CPU. my model runs forward and backward on GPU and optimizer on CPU.

Yes, I want to use DDP but have a single copy in the CPU.

In this case, you might need to use multiprocess communication to share tensors.

Sorry that I still don’t fully understand the use case. Some pseudo code would be helpful.

Thank you. I am trying to load the cpu model in one thread and broadcast in gpu. However, I have an issue with the barrier. Here is my code:

if args.local_rank in [-1,0]:
    for net1,net2 in zip(self.encoder.layer[0].named_parameters(), model_cpu.bert.encoder.layer[i].named_parameters()):
         net1[1].data.copy_(net2[1].data.clone(), non_blocking=True)
 
torch.distributed.barrier()

if args.local_rank not in [-1,0]:
    for name, p in self.encoder.layer[0].named_parameters():
        torch.distributed.broadcast(p, src=0)

my code stucks at the barrier. Any idea what could be wrong with my code? Should I even use barrier?

The default process group is per-process object, instead of per-thread. Is this just a typo and you actually mean “process”?

if args.local_rank not in [-1,0]:
    for name, p in self.encoder.layer[0].named_parameters():
        torch.distributed.broadcast(p, src=0)

For collective communications, it requires all ranks to make the same number of c10d API invocations in the same order. It seems that, with the above code, rank 0 is not participating in the broadcast? If you need broadcast in a subgroup, you will need to first create a subgroup using the new_group API and then call boradcast in that group.

Sorry, yes I mean process. I was thinking to load the model on cpu in one process, transfer it to the gpu 0 (which is in the same process) and from gpu 0, copy weights to other gpus in other processes. Broadcasting is the correct way to do so?

I was thinking to load the model on cpu in one process, transfer it to the gpu 0 (which is in the same process) and from gpu 0, copy weights to other gpus in other processes. Broadcasting is the correct way to do so?

Yes, this looks correct to me. DistributedDataParallel is actually doing the similar thing in its constructor.

1 Like

Thank you, it worked!

1 Like