Memory double when initialize DDP

I found that when I build a model like

    model = nn.Sequential(*[nn.Linear(2000, 2000).to(rank) for _ in range(20)])
    torch.cuda.synchronize()
    print_peak_memory("Max memory allocated after creating local model", rank)

    # construct DDP model
    ddp_model = DDP(model, device_ids=[rank])
    print_peak_memory("Max memory allocated after creating DDP", rank)

the memory will definitely double. And this is quite unacceptable when training a large-scale model. Are there any solutions to help with this?

PS: the code is from ZeRO

Hello! Yes, the memory required in this example will double when using DDP. This is because the world_size is 2 and the purpose of DDP is for data parallelism (same model, multiple data) where the model architecture is copied across multiple machines or GPUs to handle data in parallel.

In the case of a large-scale model that does not fit on a single machine or GPU, you should look into Distributed RPC. RPC allows for model parallelism (split model, same data) where you can split the model across machines and use RPC to communicate between the different layers. The RPC framework also handles autograd and optimizer steps internally.