Question about model behavior with DistributedDataParallel

Philipp_Singer · May 11, 2021, 9:16am

Hi,

I have a short question about the behavior of the initial model initialization with DDP.

Let’s say I use two GPUs, and my code looks something like this:

model = Model().to(device)
model = DistributedDataParallel(model)

My question now is whether the model is initialized two times separately, or whether the model as initialized on rank 0 is mirrored on rank 1.

So are both GPUs starting from the same initialization or from a different one assuming the seeds are different?

ssgosh · May 11, 2021, 9:28am

Per my understanding, the model is already initialized upon your call to Model(). Then .to(device) will move it to the specified device (which will probably be the first gpu on the first machine).

The call DistributedDataParallel(model) will replicate your model’s params on the machines and devices you’ve specified.

So it starts from the same initiaization. Moreover, the computed gradients for your model’s params on each machine/device will be sent back to the first machine/device and summed.

Philipp_Singer · May 11, 2021, 9:30am

But .to(device) is called two times where device is always the gpu+rank. So the model is sent with two different initializations at first to both gpus, but what I am wondering now is if DistributedDataParallel(model) syncs the initialization from rank 0 to rank 1?

ssgosh · May 11, 2021, 9:46am

Not sure where .to(device) is getting called twice. I believe we are supposed to call it only once, for just one device.

I have used DataParallel and not DistributedDataParallel, but I believe the semantics are the same. In my usage, I call model.to(device) only once (where device is cuda:0). After that, I do dpmodel = DataParallel(model) only once. This replicates the model onto both devices.

The input is also on the original device (cuda:0). Now, if I perform output = dpmodel(input), then the half of the input batch gets replicated to cuda:1. The computation is performed in parallel on both GPUs, and then output is computed by concatenating the two half-batches. output is on the same device as input.

ssgosh · May 11, 2021, 9:53am

Perhaps my replies are not applicable. DistributedDataParallel seems much more complicated :-). Please ignore.

Philipp_Singer · May 11, 2021, 9:53am

Yeah, DP behaves very differently than DDP. So your reply does not really help me, but thanks for the efforts.

Hope someone who knows the details about DDP can help me

ssgosh · May 11, 2021, 9:59am

From Distributed Data Parallel — PyTorch 2.1 documentation , under Internal Design

Construction: The DDP constructor takes a reference to the local module, and broadcasts state_dict() from the process with rank 0 to all other processes in the group to make sure that all model replicas start from the exact same state.

Hope this helps.