Hi,
I have a program leverage torch.nn.DataParallel to run on multiple GPUs. I tested it on a system with 3 GPUs==1080ti using pytorch==1.2 and cuda==10.0. Everything is perfect, program will run and leverage all 3 GPUs.
Now, I’m going to run it on a new server with 3GPUs==2080ti and the same config of pytorch and cuda. I got the following error:
File "/nfs/brm/main.py", line 384, in <module>
train_loss = model.fit(interactions=ds_train, verbose=True)
File "/nfs/brm/implicit.py", line 255, in fit
positive_prediction = self._net(batch_user, batch_item)
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 547, in __call__
result = self.forward(*input, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/torch/nn/parallel/data_parallel.py", line 146, in forward
"them on device: {}".format(self.src_device_obj, t.device))
RuntimeError: module must have its parameters and buffers on device cuda:0 (device_ids[0]) but found one of them on device: cuda:1
The error is clear, it seems that some part of the model or inputs are in another GPU. But it’s not the case as it runs on another server perfectly. This is the way that I’m using DataParallel:
self.device = torch.device(
"cuda" if torch.cuda.is_available() else "cpu")
self._net.to(self.device) #_net is my model
self._net = torch.nn.DataParallel(self._net)
Also I’m using the same way to move model’s input into GPUs (.to(self.device)).
The program on the new server is run if I ask for only one GPU. But it fails when I ask for multiple (e.g.3 GPUs).
Do you have any idea to investigate the problem?