Data Parallelism Questions

Hi, I have 2 GPUs running a standard DCGAN model and I have looked at some of the results and they seem a little different from the times I have run the model on a single GPU. I have just a few questions with respect to the GPU parallelism.

  1. What to do with the loss? Can we just throw the loss on a device like such: MSELoss().cuda() and it will work? Why would we want or don’t want to wrap the loss in DataParallel()?

  2. If we wrap the loss in DataParallel() with 2 GPUs we get a loss scalar from each device. Can we just average them to get a scalar to perform backpropagation or there is a better way to handle that?

  3. Tutorial here https://pytorch.org/tutorials/beginner/blitz/data_parallel_tutorial.html shows the following usage of .to():

model = Model(input_size, output_size)
if torch.cuda.device_count() > 1:
  print("Let's use", torch.cuda.device_count(), "GPUs!")
  # dim = 0 [30, xxx] -> [10, ...], [10, ...], [10, ...] on 3 GPUs
  model = nn.DataParallel(model)

model.to(device)

However, the documentation (https://pytorch.org/docs/stable/tensors.html#torch.Tensor.to) says: Returns a Tensor with the specified device and (optional) dtype. So does .to() modify the the model or tensor in place, or we have to reassign it back like so: model = model.to(device)?

I have packed 3 questions in 1 post since they all seem highly related to me. Thank you for attention.

UPDATE: Question 3 is answered by looking in the right place over here: https://pytorch.org/docs/stable/nn.html#torch.nn.Module.to. This method modifies the module in-place.

Regarding my experience with DataParallel, I stopped using it because while it runs faster, i usually need more epochs for good convergence (assuming the same minibatch size per GPU). So, I usually utilize multiple GPUs for hyperparameter tuning rather than splitting the dataset across multiple GPUs.

Regarding what to do with the loss. The loss is usually super cheap to compute, and since the gradients are gathered on the main GPU device anyways, I don’t think it’s worth the effort computing it on parallel devices. However, it is possible and not much hassle. For an example, see Uneven GPU utilization during training backpropagation - #12 by neonrights.

Regarding

So does .to() modify the the model or tensor in place, or we have to reassign it back like so: model = model.to(device) ?

I think it works in-place, but doing

model = model.to(device) 

does not hurt. I usually always do that because I remember there were some places where this was inconsistent in PyTorch (like with the tensor example you mentioned). In my opinion, it should be named

model.to_(device) 

to be clear with regard to the rest of the API (and the tensor example you mentioned)

Hope that helps!

Thanks a lot for you response!
In my opinion, it should be named model.to_(device), I agree with you on this one, also going to keep using model = model.to(device) since overhead should be insignificant.

Any idea why convergence takes longer? This is exactly the behavior I noticed with my DCGAN model.

Not sure. Maybe because if you keep the batch size and use multiple GPUs, then each model will only get 1/num_gpus of the data points and it’s then noisier when the gradients are combined from the 4 different models to update the model for the next round. And if you increate the batch size by num_gpus such that each GPU has the same batch size as in 1-GPU mode, the averaging over the gradients may also not be ideal because you loose too much signal from that.

Maybe because I am a scientist not an engineer, but I am generally not a big fan of the parallel approaches for DL because I like the method to be agnostic of what device and how many devices is/are being used :slight_smile:

1 Like