I am currently trying to train a NLP model on multi GPU (4) with two inputs : the text encoded in long tensor and the lengths of each item in the batch.
Unfortunately I got an error I do not really understand. I have followed the tutorial on pytorch so I was wondering what I was doing wrong. My code works well without the dataparallel function. For the information I am using also apex.
So the piece of code :
if cfg.multigpu: model = nn.DataParallel(model) device = "cuda:0" .... scores = model(data["Text"].to(device), data["Lengths"].to(device))
The error :
File “/home/anaconda3/envs/torch/lib/python3.7/site-packages/torch/nn/modules/module.py”, line 532, in call
result = self.forward(*input, **kwargs)
File “/home/anaconda3/envs/torch/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py”, line 151, in forward
replicas = self.replicate(self.module, self.device_ids[:len(inputs)])
File “/home/anaconda3/envs/torch/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py”, line 156, in replicate
return replicate(module, device_ids, not torch.is_grad_enabled())
File “/home/anaconda3/envs/torch/lib/python3.7/site-packages/torch/nn/parallel/replicate.py”, line 166, in replicate
setattr(replica, key, buffer_copies[j][buffer_idx])
IndexError: list index out of range
I tried to fix the “device_ids” args in DataParallel, but I got the same issue.
Do I forget to add some line of codes to make dataparallel work?