Pytorch DataParallel usage

nn.DataParallel is easy to use when we just have neural network weights.

  1. What if we have an arbitrary preprocessing (non-differentiable) function in our module? nn.DataParallel does not seem to work well on arbitrary Pytorch tensor functions; at the very least it doesn’t understand how to allocate the tensors dynamically to the right GPU.

For example, I have this normalization code as the preprocessing for my module:

def normalize(self, v): ... return torch.clamp((v.to(torch.cuda.current_device()) - mean.to(torch.cuda.current_device())) / std, -clip_range, clip_range)

I get this error:
RuntimeError: binary_op(): expected both inputs to be on same device, but input a is on cuda:1 and input b is on cuda:0

I get the above error with or without the “to(torch.cuda.current_device())”

  1. What if we spawn weights outside of the init function?
    E.g. torch.zeros(…) usually gives a similar “Input a is on cuda:1 and input b is on cuda:0” error. I fixed in these cases with “to(torch.cuda.current_device())” but not sure if there’s a silent error affecting the training speed

  2. How do we actually check that the tensors are being properly distributed across GPUs and we didn’t break the speed somewhere by moving from GPU -> CPU accidentally?

I’d like to recommend you to this figure from medium

  1. try: mean.to(v.device)
  2. `torch.zeros(,…, device=v.device)
  3. you can get it from the 2 examples.

What do you mean two examples?

Also, I think using torch.cuda.current_device() actually fixed the error. But I still don’t know if it’s being distributed uniformly across all GPUs

two examples refer to mean.to(v.device) and torch.zeros(...,device=v.device). They are solution to your first 2 questions.
It’s not guaranteed to be distributed uniformly to all GPUs. how data_parallel distribute input is :

Slices tensors into approximately equal chunks and
distributes them across given GPUs. Duplicates
references to objects that are not tensors.