Accuracy difference on multi GPU with nn.DataParallel

Hi, I am training my classification network on 4 RTX 2080 Ti. I am training resnet-152 on my machine and a single GPU can take a batch size of 32 max. I used:

model = nn.DataParallel(model, device_ids=[0, 1, 2, 3]) 

to run the model on multiple GPUs. From nvidia-smi, it seems that all the GPUs are used and I can even pass batch size of 128 [32 * 4] which makes sense. I have code that calculates training accuracy and validation accuracy after it’s trained for each epoch.

But my accuracy after each epoch increases quite fast in single GPU than on multi-GPU. I think the rate of change of accuracy should be similar in both the case. The time it takes to process one epoch decreases by about 4 [tentative] and it makes sense as I am using batch size of 128 [which is divided into 4 GPUs with mini batch size of 32]. But the rate of increase in accuracy after each epoch decreases while using 4 GPUs.

Here is the relevant code for training:

model = models.resnet152(pretrained=True)
input_features = model.fc.in_features
model.fc = nn.Linear(input_features, 100)
#model = nn.DataParallel(model, device_ids=[0, 1, 2, 3])
model = model.to(device)
criterion = nn.CrossEntropyLoss()
optimizer = optim.SGD(model.parameters(), lr=0.0001, momentum=0.9)

inside training method I have:

        for inputs, labels in train_loader:
            inputs, labels = inputs.to(device), labels.to(device)
            optimizer.zero_grad()
            outputs = model(inputs)
            preds = torch.argmax(outputs, dim=1)
            loss = criterion(outputs, labels)
            loss.backward()
           optimizer.step()

For training on multi-gpu, all I did was:
uncomment the

 model = nn.DataParallel(model, device_ids=[0, 1, 2, 3]) 

line and put batch_size=128 in the train_loader. Do I have to do something more?

@apaszke, Could you please look into this. I think multiple people have this issue. I am getting 80% Training accuracy after 15th epochs using single GPU, but when I use multiple GPUs [4 in my case], after 15th epochs, the accuracy is 28%. I ran my experiments multiple times and the multi-GPU accuracy seems quite bad. Thank you.

DataParallel attempts to replicate single GPU training on multiple GPUs. So in your setup, your batch size of 128 using the DataParallel module is as if you used a batch size of 128 in a single use setting. If your model is sensitive to batch size then this is what you’re observing.

Try a batch size of 32.

Additionally to this, while the training procedure should be similar with a constant batch size, batch norm layers will get smaller batches and might yield bad running estimates. :confused:

Apart from the suggestions here, did you try to increase the learning rate because you increased the batch size? In your single gpu usecase, the number of gradient steps is 4x than multi gpu use case. But higher batch size might give a better approximated gradient. So you can very well increase the learning rate and see if the accuracy improves.

1 Like

@aauker:
Thank you for the suggestion, please have a look at the results of my few experiments. It seems like nn.Dataparallel performs worst than than without it every time.

@ptrblck,
Thank you for the suggestion but the nn.DataParallel seems quite worse than the counterpart, please see the results of the experiments.

@InnovArul,
Thank you for the suggestion, I will run some experiments and let you know the results.

So, I run following experiments for the comparison. The training set includes 6k images.
I am starting from pretrained resnet18 everytime with a learning rate of 0.0001. I trained upto 15th epochs every time. So, the training accuracy is the final training accuracy after 15th epoch. time_per_epoch is the time it took to train one epoch of data.

  1. Exp 1: Without nn.DataParallel, batch_size = 4. time_per_epoch = 0.47 Min, training_accuracy = 38.82%.

  2. Exp 2: Without nn.DataParallel, batch_size = 8, time_per_epoch = 0.25 Min, training_accuracy = 39.33%.

  3. Exp 3: Without nn.DataParallel, batch_size = 16, time_per_epoch = 0.15 Min,
    training_accuracy = 31.26%.

  4. Exp 4: Without nn.DataParallel, batch_size = 32, time_per_epoch = 0.14 Min,
    training_accuracy = 17.08%.

  5. Exp 5: Without nn.DataParallel, batch_size = 64, time_per_epoch = 0.13 Min,
    training_accuracy = 6.6%

  6. Exp 6: With nn.DataParallel, batch_size = 4, time_per_epoch = 1.27 Min,
    training_accuracy = 0.31%.

  7. Exp 7: With nn.DataParallel, batch_size = 8, time_per_epoch = 0.65 Min,
    training_accuracy = 9.89%.

  8. Exp 8: With nn.DataParallel, batch_size = 16, time_per_epoch = 0.34 Min,
    training_accuracy = 0.14.33%.

  9. Exp 9: With nn.DataParallel, batch_size = 32, time_per_epoch = 0.19 Min,
    training_accuracy = 11.19%.

  10. Exp 10: With nn.DataParallel, batch_size = 64, time_per_epoch = 0.13 Min,
    training_accuracy = 5.30%.

  11. Exp 11: With nn.DataParallel, batch_size = 128, time_per_epoch = 0.13 Min,
    training_accuracy = 1.73%.

From these experiments, I am not able to see the advantage of nn.DataParallel. It seems like it takes longer time than the same counterpart due to data transmission between GPUs. I have no idea on the accuracy difference. But for the batch_size of 64, time_per_epoch and accuracy are comparable.

Hi, I got the same problems.

I trained same model, same input, same batch size, same epochs on 1 GPU and 8 GPUs. for the parallell, I use loss.mean().backward

But on the contrast, my parallel result is better than just on one GPU by a sizable margin.

I don’t understand which represents the real performance. Could anyone explain it?

@heroadz,
Could you please share your training code if its public, I will give it a shot. Thank you.

Thank you. Check the code below, I have simplified it.

#imagine we create datasets here

# initialize model
model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=2)
   
# This is the difference! 
if torch.cuda.device_count() > 1:
    # data parallelism
    model = torch.nn.DataParallel(model)
model.to(device)

#initialize optimizer
epochs = 3
optimizer = AdamW(model.parameters(),
      lr = 2e-5, 
      eps = 1e-8 
    )
# Total number of training steps is number of batches * number of epochs.
total_steps = len(train_dataloader) * epochs


# Create the learning rate scheduler.
scheduler = get_linear_schedule_with_warmup(optimizer, 
                                            num_warmup_steps = 0, # Default value in run_glue.py
                                            num_training_steps = total_steps)
    

# training

# Store the average loss after each epoch so we can plot them.
loss_values = []

model.zero_grad()

# For each epoch...
for epoch_i in range(0, epochs):

    # ========================================
    #               Training
    # ========================================


    # Measure how long the training epoch takes.
    t0 = time()

    # Reset the total loss for this epoch.
    total_loss = 0

    # Set our model to training mode (as opposed to evaluation mode)
    model.train()


    # For each batch of training data...
    for step, batch in enumerate(train_dataloader):


        # Unpack this training batch from our dataloader. 
        #
        b_input_ids = batch[0].to(device)
        b_input_mask = batch[1].to(device)
        b_labels = batch[2].to(device)

        # Forward pass (evaluate the model on this training batch)
        # `model` is of type: pytorch_pretrained_bert.modeling.BertForSequenceClassification
        outputs = model(b_input_ids, 
                    token_type_ids=None, 
                    attention_mask=b_input_mask, 
                    labels=b_labels)

        loss = outputs[0]

        total_loss += loss.mean().item()
#       total_loss += loss.item() # this is for single GPU

        # Perform a backward pass to calculate the gradients.
        loss.mean().backward()
#       loss.backward()  # this is for single GPU

        # Clip the norm of the gradients to 1.0.
        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)

        # Update parameters and take a step using the computed gradient
        optimizer.step()

        # Update the learning rate.
        scheduler.step()

        # Clear out the gradients (by default they accumulate)
        model.zero_grad()

    # Calculate the average loss over the training data.
    avg_train_loss = total_loss / len(train_dataloader)            

    loss_values.append(avg_train_loss)
    

    # ========================================
    #               Validation
    # ========================================
    # After the completion of each training epoch, measure our performance on
    # our validation set.

        print("")
        print("Running Validation...")

    t0 = time()

    # Put model in evaluation mode to evaluate loss on the validation set
    model.eval()

    # Tracking variables 
    preds, labels = [], []

    # Evaluate data for one epoch
    for batch in validation_dataloader:

        # Add batch to GPU
        batch = tuple(t.to(device) for t in batch)

        # Unpack the inputs from our dataloader
        b_input_ids, b_input_mask, b_labels = batch

        # Telling the model not to compute or store gradients, saving memory and speeding up validation
        with torch.no_grad():        
            # Forward pass, calculate logit predictions
            # token_type_ids is for the segment ids, but we only have a single sentence here.
            outputs = model(b_input_ids, token_type_ids=None, attention_mask=b_input_mask)

        logits = outputs[0]

        # Move logits and labels to CPU
        logits = logits.detach().cpu().numpy()
        label_ids = b_labels.to('cpu').numpy()

        preds.append(np.argmax(logits, axis=1))
        labels.append(label_ids)


    # Report the final accuracy for this validation run.
    preds = [item for sublist in preds for item in sublist]
    labels = [item for sublist in labels for item in sublist]    

    print_score(preds, labels) # my own function for evalution

# ========================================
#               Prediction
# ========================================

# Put model in evaluation mode
model.eval()

# Tracking variables 
predictions , true_labels = [], []

# Predict 
for batch in prediction_dataloader:
    # Add batch to GPU
    batch = tuple(t.to(device) for t in batch)

    # Unpack the inputs from our dataloader
    b_input_ids, b_input_mask, b_labels = batch

    # Telling the model not to compute or store gradients, saving memory and 
    # speeding up prediction
    with torch.no_grad():
      # Forward pass, calculate logit predictions
      outputs = model(b_input_ids, token_type_ids=None, 
                      attention_mask=b_input_mask)

    logits = outputs[0]

    # Move logits and labels to CPU
    logits = logits.detach().cpu().numpy()
    label_ids = b_labels.to('cpu').numpy()

    # Store predictions and true labels
    predictions.append(np.argmax(logits, axis=1))
    true_labels.append(label_ids)

predictions = [item for sublist in predictions for item in sublist]
true_labels = [item for sublist in true_labels for item in sublist]
print_score(predictions, true_labels, task)

in my opinion, when you change the batch size, the learning rate should be scaled up or down proportionally. Can you try this experiment? Batch size 64, without DataParallel, let’s say learning rate = 0.01

Hi @InnovArul,
Thank you for the suggestion. So learning rate seems to be the issue. I did three experiments and here’s the result after 15th epochs.

  1. Exp 1: Without nn.DataParallel, learning_rate = 0.01, batch_size = 64. time_per_epoch = 0.13 Min, train_acc = 75.12%.

  2. Exp 2: With nn.DataParallel, learning_rate = 0.01, batch_size = 64, time_per_epoch = 0.13 Min, train_acc = 69.16%.

  3. Exp 3: With nn.DataParallel, learning_rate = 0.01, batch_size = 128, time_per_epoch = 0.13 Min, train_acc = 72.45%.

The accuracies seems comparable now. But with batch_size=128, it takes same time per epoch as of batch_size=64. And I think nn.DataParallel loads every data to gpu:0 first, that may be the reason my system crashes if I put batch_size of 256. Do you have any solution for this? Thanks for the reply.

On the topic of time savings with data-parallel, it’s entirely plausible that the real bottleneck with your setup is data augmentation and preparation which takes place on the CPU. I’ve found that ToTensor() on the cpu takes about a second per 1k images. Your epoch takes 8 seconds for 6k.

@aauker, Thank you. Can we do these resize, ToTensor() and normalization in GPU in parallel?

I did not transfer the backward to loss.mean().backward, with loss.mean().backward I could still train my model with loss decrease smoothly. However, the overall training and testing accuracy are slightly worse than the same model which trained without dataparallel. Any clues?

is the switch to loss.mean().backward a must?