Loss different multiple GPU with DDP and single GPU

Xiaogang_Zhu · May 30, 2023, 6:56am

Hi, I noticed that when I am using DDP with 8 GPU or a single GPU to train on the same dataset, the loss plot is very different (DDP loss is higher), and it seems it takes more epoch to make DDP’s loss decrease to the single GPU’s loss. My question is:

Is the single GPU loss the same loss which can compared to DDP loss? If they are different, is that means if we use 2 GPU, 4 GPU or 16 GPU and get 3 loss, we cannot caompare them?
Can we really say if we train both 10 epochs for DDP or single GPU, the output two models are almost the same? Or the single GPU’s model is better because we have lower loss?

eqy · May 30, 2023, 7:06am

Since you didn’t mention it in your post, how are you adjusting the batch size and learning rate when scaling to more GPUs? Note that you would want to increase the learning rate when the number of GPUs is increased even if the per gpu batch size is the same in both setups, as the effective global batch size will be the dataloader batch size times the number of GPUs.

As for your second point, it is a known issue that there are diminishing returns and scaling of sample efficiency when batch size is scaled indefinitely, so it could be the case that the number of required epochs is not exactly the same between setups with differing numbers of GPUs. Naively the model with lower loss would be better, and you can consult the literature as model quality vs. batch size and sample efficiency vs. batch size is heavily studied topic.

Xiaogang_Zhu · May 31, 2023, 6:36am

Hi, thanks for your reply.

I have adjusted the batch size so multi-gpu global batch size = single GPU batch size. In this case do I still need to adjust learning rate to get the same performance of two model? The first intuition comes to my mind is their loss plot should be the same but they are different, the multi-gpu seems to converge slower.

So that is why I asked the second problem, for single GPU training and multi-gpu training although we have the same batch size, their loss has different trend. Can we compare them to say, oh lower loss better, or we just cannot compare them because they have different training property?

Clion · May 31, 2023, 6:58pm

Perhaps I should not be chiming in on this since I’m relatively new to the whole multiGPU training paradigm, but:

I have adjusted the batch size so multi-gpu global batch size = single GPU batch size.

Hopefully someone will correct me if I’m wrong here, but isn’t the reason behind the speedups of multiGPU training purely that it enables much larger global batch sizes, allowing us to crunch through large data sets in fewer passes?

Unless of course you only set multiGPU global bsize = single GPU bsize to conduct tests on the loss trend. In which case, ignore me

eqy · May 31, 2023, 10:07pm

In theory, slower convergence would be unexpected at the same batch size. Are you using normalization layers in your model? I’m wondering since you are keeping the global batch size the same if the smaller batch size per GPU could be interfering with normalization layers.

Xiaogang_Zhu · June 1, 2023, 6:56am

Actually I am trying to test if there is any difference when using pytorch distribute training so I make this setting.

Xiaogang_Zhu · June 1, 2023, 6:58am

No I don’t use any normalization layers. I am training a LSTM-autoencoder and model code is as followed:

def lstm_layer(input_size,hidden_size):
    return nn.LSTM(
      input_size=input_size,
      hidden_size=hidden_size,
      num_layers=1,
      batch_first=True,
    )

class AE(nn.Module):
  def __init__(self, seq_len, n_features):
    super(AE, self).__init__()
    self.seq_len, self.n_features = seq_len, n_features
    self.encoder1 = lstm_layer(n_features,128)
    self.encoder2 = lstm_layer(128,64)
    self.encoder3 = lstm_layer(64,32)
    self.decoder1 = lstm_layer(32,32)
    self.decoder2 = lstm_layer(32,64)
    self.decoder3 = lstm_layer(64,128)
    self.dense = nn.Linear(128,13)
  def forward(self, x):
    # x = x.reshape((1, self.seq_len, self.n_features))
    x, (_, _) = self.encoder1(x)
    x, (_, _) = self.encoder2(x)
    x, (hidden_n, _) = self.encoder3(x)
    hidden_n = hidden_n[0]
    x = hidden_n.unsqueeze(1).repeat(1,40,1)
    x, (_, _) = self.decoder1(x)
    x, (_, _) = self.decoder2(x)
    x, (_, _) = self.decoder3(x)
    x = self.dense(x)
    return x