I’ve debugged my script line by line, and found that the allocated memory get doubled when torch.distributed.Reducer
is instantiated in the constructor of DistributedDataParallel
.
I think the reducer is a necessary component for DDP, because it sums up the result from all the device.
But I don’t know how the reducer works, so that I still can’t understand why the memory gets doubled.
- Is it expected behavior that the reducer takes the additional memory as the local model takes?
- Does the reducer take the addition memory only for the rank:0 device?
I mean the addition memory consumption would not occur in the rank:1 or rank:2??
I can’t check this because I have only one gpu.