2 GPUs are almost twice as fast as 1 GPU with the same batch size!

Hi, I am comparing the training speed of the provided pytorch video models (such as Resnet-3D) on single A100 GPU (80GB GPU) vs two A100 (160GB total). I use the exact same training parameters in both cases, including the same batch size. The batch size I choose can fit into one A100 memory and does not need another A100.

With 2 A100, the training run twice as fast as 1 A100 when using mixed-precision. I also rerun the same experiment using single-precision and with RTX 3090 instead of A100 with similar outcomes. Is this expected behavior or is something wrong with my code? Please see the minimal executable code below along with the measured times and speeds.

My environment info:
Python version: 3.10
PyTorch Build: 1.13.1
CUDA version: 11.7

Executable code:

import torch
import torchvision
import os
import time
from torch import tensor, Tensor
import torch.nn as nn
import torch.nn.functional as F
from torch.cuda.amp import autocast
from torch.cuda.amp import GradScaler
from torch.utils.data import Dataset, DataLoader
from torchvision import models
from tqdm.notebook import tqdm

n_workers = 12

class TrainDataset(Dataset):
    def __len__(self):
        return 1760

    def __getitem__(self, idx):
        # Generate a random batch of inputs and labels
        x = torch.randint(0, 255, size=(3, 64, 224, 224), dtype=torch.uint8)
        y = torch.randint(0, 400, (1,))[0]

        return x, y

train_ds = TrainDataset()
train_dl = DataLoader(train_ds, batch_size=BATCH_SIZE, num_workers=n_workers, shuffle=True)

model = torchvision.models.video.r3d_18()

def train(model, train_dl, num_gpus, use_amp, warmup_iters):
    total_iters = len(train_dl)
    assert total_iters > 2*warmup_iters
    if num_gpus > 1:
        model = torch.nn.DataParallel(model, device_ids=list(range(num_gpus)))
    scaler = GradScaler(enabled=use_amp)
    criterion = torch.nn.CrossEntropyLoss()
    optimizer = torch.optim.SGD(model.parameters(), lr=1e-4)
    for i, batch in enumerate(tqdm(train_dl, leave=False)):    
        #start timer after the first warmup_iters
        if i == warmup_iters:
            start = time.time()
        xb, yb = batch[0].cuda(), batch[1].cuda()
        xb = xb / 255


        with autocast(enabled=use_amp):
            out = model(xb)
            loss = criterion(out, yb)


    end = time.time()
    elapsed = end - start
    speed = (total_iters-warmup_iters) / elapsed

    print(f'Time elapsed: {elapsed} seconds')
    print(f'Avg speed: {speed} iters/sec' )

Outputs with/without mixed-precision training for 1 vs 2 A100 GPUs:

train(model, train_dl, num_gpus=1, use_amp=True, warmup_iters=10)
Time elapsed: 70.61007738113403 seconds
Avg speed: 1.4162284437138788 iters/sec

train(model, train_dl, num_gpus=1, use_amp=False, warmup_iters=10)
Time elapsed: 116.88389921188354 seconds
Avg speed: 0.8555498291404796 iters/sec

train(model, train_dl, num_gpus=2, use_amp=True, warmup_iters=10)
Time elapsed: 37.92261838912964 seconds
Avg speed: 2.6369487194656522 iters/sec

train(model, train_dl, num_gpus=2, use_amp=False, warmup_iters=10)
Time elapsed: 59.5269513130188 seconds
Avg speed: 1.6799113308214992 iters/sec

I’m unsure if I understand the issue correctly, so let me know if I miss something.
Based on your output you are seeing a speedup in using amp vs. the native FP32/TF32 format on your A100, which is expected.
Also, you are seeing a speedup when 2 GPUs are used, which is also good.
Each GPU will use batch_size//2 samples in each training step, which would not only reduce the needed compute but also the memory copies on each device.
nn.DataParallel has some shortcomings as the model will be copied in each iteration and we generally recommend using DistributedDataParallel instead.

Thanks Patrick. I realized I had made a false assumption about the expected speed when doubling the batch size on a single GPU.

nn.DataParallel has some shortcomings as the model will be copied in each iteration and we generally recommend using DistributedDataParallel instead.

I did not observe a significant increase in speed for PyTorch stock video models when using DistributedDataParallel vs nn.DataParallel. Is there a rough estimate of the expected performance gain by using DistributedDataParallel over nn.DataParallel?