Simultaneously change parameters on all the GPUs

yangzhh · August 1, 2019, 9:56am

during training the network on 8 GPUs in parallel, I am going to manually change the parameters in the network by the following code,

for param in model.parameters():
  param.data.fill_(other parameter)

I am wondering if this would change all the parameters on different GPU or just GPU:0?

pietern · August 1, 2019, 11:40am

It depends what you’re using for parallelism.

If you use nn.DataParallel you should be able to do this, as the model is replicated to the other GPUs in every iteration. This means you only need to modify the parameters of the root module. This is also where you’d run the optimizer, for example.

yangzhh · August 3, 2019, 6:11pm

The model is replicated to the other GPUs in every iteration means the state_dicts are copied to other GPUs every iteration? So the mean and var in BN are also copied to other GPUs from the GPU:0?

Is there any document explain this process elaborately? I am really curious about the parallel mechanism utilized in PyTorch, for I always conduct experiments on multi-gpu environment.

pietern · August 5, 2019, 5:14am

Yes, that’s correct. The documentation covers this (the replication bit), see torch.nn.DataParallel. Note that this is not how the distributed version works. There, every process runs forward/backward/optimizer against a single copy of the model, so its parameters are equivalent already. Not by replicating the values, but by executing the exact same optimizer step.

yangzhh · August 5, 2019, 7:15am

Thanks a lot! I know what you mean. So, in summarize, the multi-gpu environment works like following:

Scatter the model and the state dict from GPU:0 to all the GPUs.
Split the data, and seperately forward them on different GPUs.
Gather output from GPUs to GPU:0
Calculate Loss by using outputs and targets on GPU:0
Backward Loss to GPUs and seperately calculate gradients
Gather gradients from GPUs to GPU:0
Update parameters on GPU:0
GoTo Step1.

So, the only thing that not fully synchronized is the mean and var of BN, because it does not gather to GPU:0 during backward. All the other parameters are fully synchronized because of the gather-scatter mechanism.