Memory consumption for the model get doubled after wrapped with DDP

ptrblck · September 3, 2021, 3:58am

That’s great debugging!
I’ve checked the behavior with @mcarilli and he confirms that the Reducer will create gradient buckets for each parameter, so that the memory usage after wrapping the model into DDP will be 2 x model_parameter_size. Note that the parameter size of a model is often much smaller than the activation size so that this memory increase might or might not be significant.