Hi,
This is probably trivial, but I saw in this example for ZeRO optimizer that after wrapping a model with DDP, the peak GPU memory almost doubles.
What is the reason? I thought DDP’s only job was synchronizing gradients wisely during the backward, and sure what job it needs to do during initialization. I also think it has some function during forward() that I’m not aware of, is that the case?