Wrapping with DDP increases GPU memory

yiftach · September 19, 2023, 6:54pm

Hi,
This is probably trivial, but I saw in this example for ZeRO optimizer that after wrapping a model with DDP, the peak GPU memory almost doubles.
What is the reason? I thought DDP’s only job was synchronizing gradients wisely during the backward, and sure what job it needs to do during initialization. I also think it has some function during forward() that I’m not aware of, is that the case?

Thanks,
Yiftach

smth · September 19, 2023, 7:12pm

as mentioned on that page, each worker has a separate optimizer state, so that overhead adds to the extra memory usage.

The idea of ZeroRedundancyOptimizer comes from DeepSpeed/ZeRO project and Marian that shard optimizer states across distributed data-parallel processes to reduce per-process memory footprint.

yiftach · September 19, 2023, 7:27pm

Thanks @smth, but as far as I can tell, the print that shows more memory is being allocated happens before any optimizer is instantiated:

...
    ddp_model = DDP(model, device_ids=[rank])
    print_peak_memory("Max memory allocated after creating DDP", rank)

    # define loss function and optimizer
    loss_fn = nn.MSELoss()
    if use_zero:
        optimizer = ZeroRedundancyOptimizer(
            ddp_model.parameters(),
            optimizer_class=torch.optim.Adam,
            lr=0.01
        )
    else:
        optimizer = torch.optim.Adam(ddp_model.parameters(), lr=0.01)
...