Memory consumption for the model get doubled after wrapped with DDP

I’ve debugged my script line by line, and found that the allocated memory get doubled when torch.distributed.Reducer is instantiated in the constructor of DistributedDataParallel.

I think the reducer is a necessary component for DDP, because it sums up the result from all the device.
But I don’t know how the reducer works, so that I still can’t understand why the memory gets doubled.

  1. Is it expected behavior that the reducer takes the additional memory as the local model takes?
  2. Does the reducer take the addition memory only for the rank:0 device?
    I mean the addition memory consumption would not occur in the rank:1 or rank:2??
    I can’t check this because I have only one gpu.
1 Like