Something wrong while using .cuda() in "forward" with Dataparallel: arguments are located on different GPUs

The following is a part of my code:
It works well on a single GPU, but I need to use multi GPU, but I find something wrong while using .cuda() in “forward” with Dataparallel, even I use something like ‘epsilon = self.normal.sample(self.mu.size()).cuda(self.mu.device()) ’(still can’t send tensor to right GPU) or use register_buffer (This only works in originally init). I seriously need your help !!!

class Gaussian(object):
    def __init__(self, mu, rho):
        super().__init__()
        self.mu = mu
        self.rho = rho
        self.normal = torch.distributions.Normal(0, 1)

    @property
    def sigma(self):
        return torch.log1p(torch.exp(self.rho))

    def sample(self):
        epsilon = self.normal.sample(self.mu.size()).cuda()   # This is where the error happens !
        return self.mu + self.sigma * epsilon

class SharableLinear(nn.Module):
    """Modified linear layer."""
    __constants__ = ['bias', 'in_features', 'out_features']

    def __init__(self, in_features, out_features, bias=True, ratio=0.5):
        super(SharableLinear, self).__init__()
        self.in_features = in_features
        self.out_features = out_features

        # weight and bias are no longer Parameters.
        self.weight = Parameter(torch.Tensor(out_features, in_features), requires_grad=True)
        nn.init.normal_(self.weight, 0, 0.01)
        if bias:
            self.bias = Parameter(torch.Tensor(out_features), requires_grad=True)
            nn.init.constant_(self.bias, 0)
        else:
            self.register_parameter('bias', None)

        fan_in, _ = _calculate_fan_in_and_fan_out(self.weight)

        total_var = 2 / fan_in
        noise_var = total_var * ratio
        mu_var = total_var - noise_var

        noise_std, mu_std = math.sqrt(noise_var), math.sqrt(mu_var)
        rho_init = np.log(np.exp(noise_std) - 1)

        self.weight_rho = nn.Parameter(torch.Tensor(out_features, 1).uniform_(rho_init, rho_init))

        self.weight_gaussian = Gaussian(self.weight, self.weight_rho)

    def forward(self, input, sample=False):
        if sample:
            weight = self.weight_gaussian.sample()   #  I have to reset weight inside forward, which means .cuda() have to be used
        else:
            weight = self.weight

        return F.linear(input, weight, self.bias)

Hi,

The DataParallel is splitting your model to run on mutiple GPUs. So different copies of your model will be located on different GPUs.
But when you do .cuda() , this is the same as .cuda(0) and so all the copies that don’t live on the GPU 0 will have problems as you give them a Tensor on the wrong GPU.
You can replace it with: .to(self.mu.device) to be sure to always place it on the same device as the other Tensors for that copy.

Hi,
Many thanks for your reply!
when I changed .cuda() to .cuda(self.mu.device) or .to(self.mu.device) It still raise RuntimeError: arguments are located on different GPUs.

        epsilon = self.normal.sample(self.mu.size()).to(self.mu.device)

Here are some details.

  File "/home/bzg/anaconda3/envs/torch1.2/lib/python3.7/site-packages/torch/_utils.py", line 369, in reraise
    raise self.exc_type(msg)
RuntimeError: Caught RuntimeError in replica 1 on device 1.



    return F.linear(input, weight, self.bias)
  File "/home/bzg/anaconda3/envs/torch1.2/lib/python3.7/site-packages/torch/nn/functional.py", line 1371, in linear
    output = input.matmul(weight.t())
RuntimeError: arguments are located on different GPUs at /opt/conda/conda-bld/pytorch_1565272271120/work/aten/src/THC/generic/THCTensorMathBlas.cu:260

My closest try is to turn Gaussian to inherent nn.Module and turn sample to forward():

class SharableLinear(nn.Module):
    def forward(self, input, sample=False):
        if sample:
            weight = self.weight_gaussian.forward()
        else:
            weight = self.weight

class Gaussian(nn.Module):
    def __init__(self, mu, rho):
        super().__init__()
        self.mu = mu
        self.rho = rho
        self.normal = torch.distributions.Normal(0, 1)

    @property
    def sigma(self):
        return torch.log1p(torch.exp(self.rho))

    def forward(self):
        epsilon = self.normal.sample(self.mu.size()).cuda()
        return self.mu + 0.1 * self.sigma * epsilon

This time there is no error and the code can run. But a warning left :


This should still have a bad effect.Do you have any ideas?

If it is the loss generate from different GPUs, I can simply do loss.mean(). But I have no idea to handle this problem.

The warning seems to say that your forward returns a scalar which cannot be concatenated directly so they are made into 1D Tensor with 1 element and then concatenated.
This is fine

Thank you for your reply! The existence of this warning still worries me, maybe I’ll just have to postpone that.

You can call .view(1) or .unsqueeze(1) on your return value from the forward to get something that is 1D and silence the warning.

Thanks again, but I need to return 2D tensor, as self.mu is 2D, sigma is 1D. Besides, I use self.sigma.expand(self.mu.size()) or sigma.unsqueeze(1) still not fix the warning. I even don’t know where this warning refers to. The changed code still works well on a single GPU, so the problem must be in Data Parallel, maybe I should learn more about its mechanism first.
The only information is