Hi. I found a situation that training a module with or without dataparallel with return a different module.
The trainig process is simple:
model = Model(input_size, output_size)
model = model.cuda()
if no dp:
pass
elif 2 gpu:
model = nn.DataParallel(model, device_ids=[0,1])
elif 3 gpu:
model = nn.DataParallel(model, device_ids=[0,1,2])
for data in rand_loader:
input = data.to('cuda')
output = model(input)
loss = compute_loss()
loss.backwards()
The results are different when training with one gpu (no dataparallel), or training with different size of device_ids, which means result different with 2 or 3 gpu.
What should I do if I want exactly same module in those three scenarios? Any help will be really appreciated!
If the model was in the eval mode, the results are identical. But the issue is that I have to train the model using dataparallel. I think the difference should come from the dropout.