No speed increase when using multiple GPUs

Hello, I am experimenting with using multiple GPUs on my university cluster, but I do not see any speed increase when doing so. I am curious why this is. Below I share some data and code.

Duration of 3 epochs’ worth of training:

Using 1 Tesla V100-SXM2-32GB:

  • 6 minutes 1 second
  • 5 minutes 55 seconds

Using 2 Tesla V100-SXM2-32GB:

  • 6 minutes 4 seconds
  • 5 minutes 39 seconds

Using 4 Tesla V100-SXM2-32GB:

  • 6 minutes 11 seconds
  • 5 minutes 46 seconds

Relevant code segments:

Checking GPU status:

print("Is a GPU available?", torch.cuda.is_available())
print("How many GPUs are available?", torch.cuda.device_count())
print("What's the current GPU number?", torch.cuda.current_device())
print("Where's the first GPU?", torch.cuda.device(0))
print("What are the names of the GPUs?")
for i in range(torch.cuda.device_count()):
    print(torch.cuda.get_device_name(i))

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print('This should say cuda if the GPUs are set up properly:', device)

Output for the above (in the 4 GPU job):

Is a GPU available? True
How many GPUs are available? 4
What's the current GPU number? 0
Where's the first GPU? <torch.cuda.device object at 0x2b44157750a0>
What are the names of the GPUs?
Tesla V100-SXM2-32GB
Tesla V100-SXM2-32GB
Tesla V100-SXM2-32GB
Tesla V100-SXM2-32GB
This should say cuda if the GPUs are set up properly: cuda

Batch size:

batch_size = 64

Model training function (mostly copied from a PyTorch tutorial):

def train_model(model, dataloaders, criterion, optimizer, num_epochs=25, is_inception=False):
    since = time.time()
    
    val_acc_history = []
    train_acc_history = []

    best_model_wts = copy.deepcopy(model.state_dict())
    best_acc = 0.0

    for epoch in range(num_epochs):
        print('Epoch {}/{}'.format(epoch, num_epochs - 1))
        print('-' * 10)

        # Each epoch has a training and validation phase
        for phase in ['train', 'val']:
            if phase == 'train':
                model.train() # Set model to training mode
            else:
                model.eval() # Set model to evaluate mode

            running_loss = 0.0
            running_corrects = 0

            # Iterate over data.
            for inputs, labels in dataloaders[phase]:
                inputs = inputs.to(device)
                labels = labels.to(device)

                # zero the parameter gradients
                optimizer.zero_grad()

                # forward
                # track history if only in train
                with torch.set_grad_enabled(phase == 'train'):
                    # Get model outputs and calculate loss
                    # Special case for inception because in training it has an
                    # auxiliary output. In train mode we calculate the loss by
                    # summing the final output and the auxiliary output but in
                    # testing we only consider the final output.
                    if is_inception and phase == 'train':
                        # From https://discuss.pytorch.org/t/how-to-optimize-inception-model-with-auxiliary-classifiers/7958
                        outputs, aux_outputs = model(inputs)
                        loss1 = criterion(outputs, labels)
                        loss2 = criterion(aux_outputs, labels)
                        loss = loss1 + 0.4*loss2
                    else:
                        outputs = model(inputs)
                        loss = criterion(outputs, labels)

                    _, preds = torch.max(outputs, 1)

                    # backward + optimize only if in training phase
                    if phase == 'train':
                        loss.backward()
                        optimizer.step()

                # statistics
                running_loss += loss.item() * inputs.size(0)
                running_corrects += torch.sum(preds == labels.data)

            epoch_loss = running_loss / len(dataloaders[phase].dataset)
            epoch_acc = running_corrects.double() / len(dataloaders[phase].dataset)

            print('{} Loss: {:.4f} Acc: {:.4f}'.format(phase, epoch_loss, epoch_acc))

            # deep copy the model
            if phase == 'val' and epoch_acc > best_acc:
                best_acc = epoch_acc
                best_model_wts = copy.deepcopy(model.state_dict())
            if phase == 'val':
                val_acc_history.append(epoch_acc)
            if phase == 'train':
                train_acc_history.append(epoch_acc)

        print()

        torch.save({
            'epoch': epoch,
            'model_state_dict': model.state_dict(),
            'optimizer_state_dict': optimizer.state_dict(),
            # 'best_model_accuracy': best_acc,
            # 'best_model_weights': best_model_wts,
            'loss': loss,
            # 'train_hist': train_acc_history,
            # 'val_hist': val_acc_history,
        # }, f'/project/rrg-lelliott/jsa378/model_1_output/checkpoint_run_{run}_epoch_{epoch}.tar')
        }, str(os.getenv('location3'))+f'/checkpoint_run_{run}_epoch_{epoch}.tar')

        torch.save({
            'epoch': epoch,
            'model_state_dict': model.state_dict(),
            'optimizer_state_dict': optimizer.state_dict(),
            'best_model_accuracy': best_acc,
            'best_model_weights': best_model_wts,
            'loss': loss,
            'train_hist': train_acc_history,
            'val_hist': val_acc_history,
        # }, f'/project/rrg-lelliott/jsa378/model_1_output/checkpoint_run_{run}.tar')
        }, str(os.getenv('location3'))+f'/checkpoint_run_{run}.tar')

    time_elapsed = time.time() - since
    print('Training complete in {:.0f}m {:.0f}s'.format(time_elapsed // 60, time_elapsed % 60))
    print('Best val Acc: {:4f}'.format(best_acc))

    # load best model weights
    model.load_state_dict(best_model_wts)
    return model, val_acc_history, train_acc_history

Model initialization and placement on GPU:

    model_ft, input_size = initialize_model(model_name, num_classes, feature_extract, use_pretrained=True)
    print(model_ft)
    if torch.cuda.device_count() > 1:
        print("Let's use", torch.cuda.device_count(), "GPUs!")
        model_ft = nn.DataParallel(model_ft)
    model_ft = model_ft.to(device)

Questions and Comments:

Below I’ve listed a few questions that come to mind.

  1. Are there any apparent errors in my code (for the purpose of using multiple GPUs)?
  2. Do I need to double or quadruple my batch size to see an increase in training speed?
  3. Is it possible that the cost of inter-GPU communication roughly cancels out any speed increase?
  4. My data set is only ~14,000 images, about 85 MB total. I am using the off-the-shelf AlexNet. Is it possible that my model and data are too small to benefit from multiple GPUs?

I’m interested in using multiple GPUs because my advisor and I are toying with the idea of randomly splitting the data into training and test sets 100 times, and training for 200 epochs each. 20,000 epochs at 2 minutes per epoch is like 4 weeks, hence my interest in speeding things up.

I appreciate any help with this.

  1. I don’t see any obvious issues in your code.

  2. Using data parallel allows you to increase the global batch size, which should reduce the epoch time as each GPU is able to process the ~same batch size used in the single GPU use case.

  3. That could be the case as you are using nn.DataParallel which introduces a lot of communication overhead. We generally recommend to use DistributedDataParallel with a single process per device for the best performance.

ptrblck,

Thanks for your reply. I just ran a job with a quadrupled batch size (from 64 to 256) and 4 GPUs. 3 epochs of training took 6 minutes 4 seconds, and 5 minutes 30 seconds. I guess this is a sign that I need to look into DDP, as you mentioned.

Here is an update. I have been following the documentation for my university cluster to use DDP on one GPU. (This seemed like a good place to start with DDP.) I decided to try to run my model six times in parallel on one GPU. I requested

  • 1 node
  • 6 tasks
  • 1 cpu per task
  • 2000 MB memory per cpu
  • 1 V100 32 GB GPU.

I set num_workers=6 for my data loaders and sextupled my batch size, from 64 to 384. The time taken to train for 3 epochs went down from about 6 minutes to roughly 1 minute 20 seconds.

Using nvidia-smi I saw that peak memory usage was only a bit over 5000 MiB, so I figured I’d try to go further. I modified the above by requesting 32 tasks, and I set num_workers=32 and increased my batch size to 64 x 32 = 2048. Peak memory usage is now around 24,500 MiB, but the 3 epochs now take about 2 to 2.5 minutes to train, so it’s gotten slower for some reason.

Two other seemingly interesting observations:

  • I think peak power usage was 270 W / 300 W for the 6-fold run, but lower (maybe 70 W) for the 32-fold run. (I guess it’s possible that a big spike occurred in the 32-fold run and I didn’t see it.)
  • Model accuracy seems to be way down on the 32-fold run. In the 6-fold run, my model classification accuracy (tested after each epoch) was 66%, 75%, 79%. In the 32-fold run, my model accuracy was 56%, 59%, 63%. (There is some variability in these numbers, but the gap seems consistent, based on a small amount of testing.)

For comparison, I just ran my model the old way (without any parallelism), and here are some of the results:

  • peak memory usage is 3636 MiB
  • peak power was 89 W (I think)
  • classification accuracies were 73%, 77%, 79%
  • training time for 3 epochs was 5 minutes 57 seconds.

The slowdown from the 6-fold to the 32-fold job suggests that I’m using a suboptimal configuration. However, the apparent decrease in accuracy seems much more mystifying.

Essentially I’m looking to maximize GPU utilization (and therefore hopefully speed), without any accuracy penalty.

PS: I used the seff command with Slurm to get some more resource utilization information:

6-fold job:

Nodes: 1
Cores per node: 6
CPU Utilized: 00:16:08
CPU Efficiency: 22.63% of 01:11:18 core-walltime
Job Wall-clock time: 00:11:53
Memory Utilized: 4.58 GB
Memory Efficiency: 39.10% of 11.72 GB

32-fold job:

Nodes: 1
Cores per node: 32
CPU Utilized: 00:19:40
CPU Efficiency: 4.30% of 07:37:36 core-walltime
Job Wall-clock time: 00:14:18
Memory Utilized: 4.21 GB
Memory Efficiency: 6.73% of 62.50 GB

Regular job (no parallelism):

Nodes: 1
Cores per node: 2
CPU Utilized: 00:20:37
CPU Efficiency: 41.65% of 00:49:30 core-walltime
Job Wall-clock time: 00:24:45
Memory Utilized: 2.45 GB
Memory Efficiency: 62.61% of 3.91 GB

Increasing the global batch size might need changes in your training hyperparameters. I think this topic discusses some differences in detail, so you could check if some of them are also valid for your use case.

ptrblck,

Thanks for this link. I will have to look at it more carefully.