Confusion about multi-gpu

I want to use pytorch for multi-gpu parallel computation. I see that the official documentation offers the following way:

device = torch.device("cuda:0")
model = nn.DataParallel(model)

I am a bit confused since if we are sending the model to “cuda:0”, how does the other GPUs are going to be utilized?

Model is sent to cuda 0. This means model parameters are stored in that gpu.
DataParallel split the batch equally among all the gpus available. Then it computes the forward/backward pass in each gpu independently and upgrade the model with averaged parameters.

For each gpu everything is independent until they move data back to one gpu. It means, batch normalization or whatever is computer at a mini-batch level. I saw some 3rd party implementations which share data to compute a global batch normalization and so on but never tried.

The main gpu usually requires more memory since it’s storing model paramters and optimizers’ parameters. Therefore I would recommend to use a different gpu to load data and to compute loss.

@JuanFMontesinos Thanks a lot for your response!