How to train same model with and without dataparallel

Hi. I found a situation that training a module with or without dataparallel with return a different module.

The trainig process is simple:

model = Model(input_size, output_size)
model = model.cuda()

if no dp:
elif 2 gpu:
  model = nn.DataParallel(model, device_ids=[0,1])
elif 3 gpu:
  model = nn.DataParallel(model, device_ids=[0,1,2])

for data in rand_loader:
    input ='cuda')
    output = model(input)

    loss = compute_loss()

The results are different when training with one gpu (no dataparallel), or training with different size of device_ids, which means result different with 2 or 3 gpu.

What should I do if I want exactly same module in those three scenarios? Any help will be really appreciated!

Hey @ronzhou, could you please clarify what you mean by “results are different”? Are there huge gaps in the loss curve?

You can reproduce the problem with the following code.

from torch import nn
layer = nn.TransformerEncoderLayer(d_model=20, nhead=5, batch_first=True)
model = nn.TransformerEncoder(layer, num_layers=5)'cuda:2')
model_parallel = nn.DataParallel(model, device_ids=[2,3])
model_parallel2 = nn.DataParallel(model, device_ids=[2,5,6])
input_ = torch.randn(size=(31, 14, 20)).to('cuda:2')
model_result = model(input_)
model_parallel_result = model_parallel(input_)
model_parallel_result2 = model_parallel2(input_)
abs_max = lambda a,b : f"{(a - b).abs().max().item():.3f}"
print(abs_max(model_parallel_result, model_result)) #1.576
print(abs_max(model_parallel_result, model_parallel_result2)) #1.416

If the model was in the eval mode, the results are identical. But the issue is that I have to train the model using dataparallel. I think the difference should come from the dropout.