I tried to use it as a drop-in replacement, so the same setup as in the first post:
device = torch.device('cuda:4')
model = nn.DataParallel(model, device_ids=[4,5])
and as you suggested I removed the to(device) calls from inside the forward function – but I run into this error:
RuntimeError: module must have its parameters and buffers on device cuda:4 (device_ids[0]) but found one of them on device: cpu
So, after some research, it seems, you have to call to(device) on the model and input before passing it to forward, basically moving it to the primary GPU from where it’ll be scattered to other ones? That’s how I understand it right now. So after that, and moving the input tensors to the same device, it stops returning that error message.
model = nn.DataParallel(model, device_ids=[4,5]).to(device)
model([[x.to(device) for x in y] for y in inputs], lengths.to(device), g=False)
… and allows me to reach this point in my forward function, which previously had to(device) and now doesn’t, where I dynamically create a padding mask:
mask = torch.arange(max_len).expand(len(lengths), max_len) < lengths.unsqueeze(1)
And returns this error:
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:4 and cpu!
So I’m not sure, if I’m now totally on the wrong track but the last error at least suggest that I need some kind of to(device) inside forward, but of course it should be the corresponding device and not a fixed one.