Question about model behavior with DistributedDataParallel

Hi,

I have a short question about the behavior of the initial model initialization with DDP.

Let’s say I use two GPUs, and my code looks something like this:

model = Model().to(device)
model = DistributedDataParallel(model)

My question now is whether the model is initialized two times separately, or whether the model as initialized on rank 0 is mirrored on rank 1.

So are both GPUs starting from the same initialization or from a different one assuming the seeds are different?

Per my understanding, the model is already initialized upon your call to Model(). Then .to(device) will move it to the specified device (which will probably be the first gpu on the first machine).

The call DistributedDataParallel(model) will replicate your model’s params on the machines and devices you’ve specified.

So it starts from the same initiaization. Moreover, the computed gradients for your model’s params on each machine/device will be sent back to the first machine/device and summed.

But .to(device) is called two times where device is always the gpu+rank. So the model is sent with two different initializations at first to both gpus, but what I am wondering now is if DistributedDataParallel(model) syncs the initialization from rank 0 to rank 1?

Not sure where .to(device) is getting called twice. I believe we are supposed to call it only once, for just one device.

I have used DataParallel and not DistributedDataParallel, but I believe the semantics are the same. In my usage, I call model.to(device) only once (where device is cuda:0). After that, I do dpmodel = DataParallel(model) only once. This replicates the model onto both devices.

The input is also on the original device (cuda:0). Now, if I perform output = dpmodel(input), then the half of the input batch gets replicated to cuda:1. The computation is performed in parallel on both GPUs, and then output is computed by concatenating the two half-batches. output is on the same device as input.

Perhaps my replies are not applicable. DistributedDataParallel seems much more complicated :-). Please ignore.

Yeah, DP behaves very differently than DDP. So your reply does not really help me, but thanks for the efforts.

Hope someone who knows the details about DDP can help me

From Distributed Data Parallel — PyTorch 2.1 documentation , under Internal Design

Construction: The DDP constructor takes a reference to the local module, and broadcasts state_dict() from the process with rank 0 to all other processes in the group to make sure that all model replicas start from the exact same state.

Hope this helps.

1 Like