Multi-GPU training pipeline in 0.4.1

animebing · December 15, 2018, 6:19am

I am trying to use torch.nn.DataParallel to train a model in multiple gpus in 0.4.1, but I find some different solution from google, which makes me confused. Following is the basic configuration.

model = resnet50()
device_0 = torch.device(‘cuda:0’)
device_ids = [0, 1, 2, 3]

I want to know:

which order below is correct?
- model.to(device_0)
  model = torch.nn.DataParallel(model, device_ids=device_id)
  
  or
- model = torch.nn.DataParallel(model, device_ids=device_id)
  model.to(device_0)
for optimizer ,
- what parameter should I give: the model before torch.nn.DataParallel or after that.
- whether it is necessary to wrap the optimizer using torch.nn.DataParallel
In DataParallel source code, the outputs from different gpus is gathered to output device, and loss is computed in output device, then backward is done in output devices or all other devices and how?

JuanFMontesinos · December 15, 2018, 1:45pm

Okay, some explanations with extra info.

DataParallel duplicates the model on the gpus. When you compute the forward pass, the batch is split into a per-gpu minibatch with a balanced amount of samples BS/n_gpus.

The hidden knowledge is that there exist a main GPU, which typically used more memory.
If, for example, you use an optimizer with running estimators, those will be stored in that main GPU.

This answers you question 1. the proper way is the second. Note that if you set output_device arg, this must coincide with the main gpu or you will get an error like tensors not allocated in the same gpu.

DataParallel Implements data parallelism at the module level. This container parallelizes the application of the given module by splitting the input across the specified devices by chunking in the batch dimension. In the forward pass, the module is replicated on each device, and each replica handles a portion of the input. During the backwards pass, gradients from each replica are summed into the original module. The batch size should be larger than the number of GPUs used.

And for your last question. You don’t have to warp the optimizer and you can do it before defining data parallel