Hi. I found a situation that training a module with or without dataparallel with return a different module.
The trainig process is simple:
model = Model(input_size, output_size)
model = model.cuda()
if no dp:
elif 2 gpu:
model = nn.DataParallel(model, device_ids=[0,1])
elif 3 gpu:
model = nn.DataParallel(model, device_ids=[0,1,2])
for data in rand_loader:
input = data.to('cuda')
output = model(input)
loss = compute_loss()
The results are different when training with one gpu (no dataparallel), or training with different size of device_ids, which means result different with 2 or 3 gpu.
What should I do if I want exactly same module in those three scenarios? Any help will be really appreciated!