I found that when I build a model like
model = nn.Sequential(*[nn.Linear(2000, 2000).to(rank) for _ in range(20)])
print_peak_memory("Max memory allocated after creating local model", rank)
# construct DDP model
ddp_model = DDP(model, device_ids=[rank])
print_peak_memory("Max memory allocated after creating DDP", rank)
the memory will definitely double. And this is quite unacceptable when training a large-scale model. Are there any solutions to help with this?
PS: the code is from ZeRO
Hello! Yes, the memory required in this example will double when using DDP. This is because the world_size is 2 and the purpose of DDP is for data parallelism (same model, multiple data) where the model architecture is copied across multiple machines or GPUs to handle data in parallel.
In the case of a large-scale model that does not fit on a single machine or GPU, you should look into Distributed RPC. RPC allows for model parallelism (split model, same data) where you can split the model across machines and use RPC to communicate between the different layers. The RPC framework also handles autograd and optimizer steps internally.