That’s great debugging!
I’ve checked the behavior with @mcarilli and he confirms that the Reducer
will create gradient buckets for each parameter, so that the memory usage after wrapping the model into DDP
will be 2 x model_parameter_size
. Note that the parameter size of a model is often much smaller than the activation size so that this memory increase might or might not be significant.
5 Likes