torch.nn.DataParallel problem with new server

Hi,

I have a program leverage torch.nn.DataParallel to run on multiple GPUs. I tested it on a system with 3 GPUs==1080ti using pytorch==1.2 and cuda==10.0. Everything is perfect, program will run and leverage all 3 GPUs.

Now, I’m going to run it on a new server with 3GPUs==2080ti and the same config of pytorch and cuda. I got the following error:

File "/nfs/brm/main.py", line 384, in <module>
    train_loss = model.fit(interactions=ds_train, verbose=True)
  File "/nfs/brm/implicit.py", line 255, in fit
    positive_prediction = self._net(batch_user, batch_item)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 547, in __call__
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/parallel/data_parallel.py", line 146, in forward
    "them on device: {}".format(self.src_device_obj, t.device))
RuntimeError: module must have its parameters and buffers on device cuda:0 (device_ids[0]) but found one of them on device: cuda:1

The error is clear, it seems that some part of the model or inputs are in another GPU. But it’s not the case as it runs on another server perfectly. This is the way that I’m using DataParallel:

        self.device = torch.device(
            "cuda" if torch.cuda.is_available() else "cpu")
        self._net.to(self.device) #_net is my model
        self._net = torch.nn.DataParallel(self._net)

Also I’m using the same way to move model’s input into GPUs (.to(self.device)).

The program on the new server is run if I ask for only one GPU. But it fails when I ask for multiple (e.g.3 GPUs).

Do you have any idea to investigate the problem?

Can you try this instead? This would ensure we always allocate parameters on device 0

self.device = torch.device(
            "cuda:0" if torch.cuda.is_available() else "cpu")

Thanks for your answer. It changes the error message to another one:

File "/nfs/brm/main.py", line 385, in <module>
    train_loss = model.fit(interactions=ds_train, verbose=True)
  File "/nfs/brm/implicit.py", line 255, in fit
    positive_prediction = self._net(batch_user, batch_item)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 547, in __call__
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/parallel/data_parallel.py", line 152, in forward
    outputs = self.parallel_apply(replicas, inputs, kwargs)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/parallel/data_parallel.py", line 162, in parallel_apply
    return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/parallel/parallel_apply.py", line 85, in parallel_apply
    output.reraise()
  File "/usr/local/lib/python3.6/dist-packages/torch/_utils.py", line 369, in reraise
    raise self.exc_type(msg)
RuntimeError: Caught RuntimeError in replica 1 on device 1.
Original Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/parallel/parallel_apply.py", line 60, in _worker
    output = module(*input, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 547, in __call__
    result = self.forward(*input, **kwargs)
  File "/nfs/brm/representations.py", line 95, in forward
    attention_mask=input_mask)[0]
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 547, in __call__
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/transformers/modeling_distilbert.py", line 592, in forward
    head_mask=head_mask)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 547, in __call__
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/transformers/modeling_distilbert.py", line 461, in forward
    embedding_output = self.embeddings(input_ids)   # (bs, seq_length, dim)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 547, in __call__
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/transformers/modeling_distilbert.py", line 92, in forward
    word_embeddings = self.word_embeddings(input_ids)                   # (bs, max_seq_length, dim)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 547, in __call__
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/sparse.py", line 114, in forward
    self.norm_type, self.scale_grad_by_freq, self.sparse)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/functional.py", line 1467, in embedding
    return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
RuntimeError: arguments are located on different GPUs at /pytorch/aten/src/THC/generic/THCTensorIndex.cu:397

Could you share the code you’re running so that I can try to repro this locally and see what the issue might be?

Unfortunately, the source code depends on different modules and large data, it’s not useful for debugging purpose. In addition, the code perfectly run on other server when I connect to it via ssh and directly run my python.
The new server is based on Kubernetes and OpenShift and my code is deployed via a docker container. I think that it causes the misidentification of GPUs by DataParallel.
Did you have any related experiment?

Are the inputs you feed to the model on the same device (cuda:0) when you run the training loop?

Also, would it be possible for you to come up with a small example that reproduces the problem you’re seeing? It would be easier to debug the issue that way.

It was a bug in CUDA10, just upgrading to CUDA 10.1 solved the problem.