I have DataParallel code as follows
model = Model(args)
model = nn.DataParallel(model, device_ids = [0,1])
model = model.cuda()
for data in train_data:
data = data_to_cuda(data)
predicted_output = model(data)
loss = compute_loss(predicted_output, data['labels])
Now I am getting error
return F.conv2d(input, weight, bias, self.stride,
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:0!
I went to the line where this error was found and it occurred in the forward call of one of the submodules of the model, which is as follows
class ImageEncoder:
def __init__(self, args):
resnet_model = torchvision.models.resnet18()
self.model = torch.nn.Sequential(*(list(resnet_model.children())[:-1]))
def forward(self, x):
x = x.float()
output = self.model(x)
return output
I printed out the following things, after entering the forward call of the ImageEncoder
def forward(self, x):
print('x', x.device)
x = x.float()
print('x.float', x.device)
print('model', self.model[4][0].conv1.weight.device)
output = self.model(x).squeeze(-1).squeeze(-1)
print('output', output.device)
return output
I got following output
x cuda:0
x cuda:1
x.float cuda:1
model cuda:0
x.float cuda:0
model cuda:0
output cuda:0
It seems that ImageEncoder
is only copied to one device. Can someone please explain me what is wrong with my code.