Tensors are on different GPUs Runtime Error

rahul · April 4, 2018, 4:42am

I am trying to use multiple GPUs for training. I keep getting “Runtime Error: tensors are on different GPUs”. I have searched for various solutions on this forum but none of them works.

ptrblck · April 4, 2018, 6:32am

Could you post a small (dummy) code snippet reproducing your error please?

rahul · April 5, 2018, 5:33am

Hi, thanks for a quick reply! I had a quick question. Why is it not advised to use multiple GPUs using muliprocesing? http://pytorch.org/docs/master/notes/cuda.html towards the end you have advise Use nn.DataParallel instead of multiprocessing.

While there is an example to use multiple GPUs using multiprocessing http://pytorch.org/tutorials/intermediate/dist_tuto.html . Isnt that contradictory?

My concern is that I have implemented a neural network that is not the usual forward and backward pass. I am trying to comeup with something that is much simpler as a sample code, and post it very soon.

ptrblck · April 5, 2018, 7:26am

As far as I understand it, the usage of multiprocessing on a single machine is not advised, unless you have a specific use case and know what you are doing. DataParallel should be sufficient for most use cases.

However, for distributed systems multiprocessing is necessary and the torch.distributed package handles some of the management.

rahul · April 5, 2018, 4:16pm

Hi, I am using 8 GPUs, would that be a single machine or multiple machines (8 GPUs are combined into a single machine)?

ptrblck · April 5, 2018, 4:28pm

In this case I would use DataParallel, since it’s a single machine.

rahul · April 5, 2018, 4:49pm

I am presently using the multigpu code developed using dist.all_reduce() function as in http://pytorch.org/tutorials/intermediate/dist_tuto.html (the reason being that Dataparallel has the issue of “tensors are on different GPUs” and I have spent a lot of time to rectify it, but it sill doesn’t work). Do you suggest that dist.all_reduce is worse than nn.DataParallel for 8 GPUs?

rahul · April 5, 2018, 6:54pm

Hi, this is the simplest example of what I am trying to do in nn Dataparallell (this code hasn’t wrapped the net and othernet in DataParallel, but gives an idea of the logic that I am trying out)

class OtherNet():
   def _init_(self):
         self.layer1_other = nn.Linear(10,5)
   def forward(self,x):
         return self.layer1_other(x)


class Net():
   def _init_(self):
         self.layer1 = nn.Linear(10,5)
   def forward(self,x,other_net):
         y = some_function_of_other_net(x)
         return self.layer1(y)
def main():
   net = Net()
   other_net = OtherNet()
   scalar = 0
  #STEP 1.
   for data,target in dataset:
        output = net(data,other_net)
        loss = criterion(output,target)
        scalar = scalar + loss
   #got new scalar now run backprop on net
   #STEP 2.
   for data,target in dataset:
       loss.backward()
       optimizer.step()
  #Now use scalar to update other_net
  #STEP 3.
   for data,target in dataset:
         output = net(data,other_net)
         loss = criterion(output,target)
         other_net.param += some_function(scalar,loss)

ptrblck · April 5, 2018, 10:49pm

Well, if your models fit on one GPU, DataParallel should work.
Your code seems to be a bit strange (net output is loss), but since this seems to be just an example, I hope to get the idea.

Could you create a code example reproducing the initial error?

rahul · April 5, 2018, 11:40pm

Hi Ptr,
I have updated the code. Let’s assume that I am only interested in performing the following 3. steps (I agree it seems bit unusual NN).
In the below, I will explain in summary the key aspects of what I am trying to do:

Step 1. calculate a scalar quantity called scalar in step 1.
Step 2. backpopagate to update net.
Step 3… modify parameters of other_net based on some function of scalar and loss

Now my problem is that each GPU computes different value of scalar. I want to average this scalar over all the 8 GPUs. The code doesn’t work and returns error “tensors on different GPUs”. So I explicitly start 8 processes, and each NN have a self.rank = rank (rank of process), and each variable etc. have .cuda(self.rank) at the end. However, GPU utilization becomes very bad.

ptrblck · April 6, 2018, 6:12am

Would it be possible to use DataParallel on just some part of your whole model?
In the ImageNet example only the features are used in parallel for AlexNet and VGG.
In this way you could calculate the scalar values once without averaging for each GPU.
What do you think?

rahul · April 6, 2018, 6:23am

Hi, thanks for replying. If that is possible (DataParallel on some part of model), then your suggestion is to calculate the scalar using a single GPU? I will go through the Imagenet example now to look into what is being done there.

rahul · April 6, 2018, 6:42am

actually when I do a print of x.get_device() and y.get_device(), where x and y are the tensors involved , they show the same device. Even then I get error, tensors are on different GPUs (using nn.Dataparallel). This occurs even for a vanilla resnet with some minor modifications.