Why are 2 GPUs twice as fast as 1 GPU even though the batch size is the same?

I am doing video classification using optical flow. My training pipeline is a simple two stage networks:

[input video] → RAFT model (inference only) → [optical flow] → Inception 3D model → [class label]

I have access to two A100 80GB GPUs. But A batch size of 32 can fit in the memory of one GPU. So there is no need to use 2 GPUs since I am not constrained by GPU memory. However, I notice that using the same batch size of 32 but with two GPUs (so that each GPU batch size is 16 rather than 32) and wrapping both the RAFT and the Inception 3D models in DataParallel doubles the speed. The bottleneck in my pipeline seems to be the RAFT model (most time-consuming).

To summarize, here are the two cases:
case 1:
One GPU (of A100), batch size = 32, time per epoch= 4 hours

case 2:
Two GPUs (of A100), batch size = 32 (16 each), time per epoch= 2 hours

Any idea why case 1 is twice as slow as case 2 even though the batch size in both cases is the same? What am I missing?

This shouldn’t be the case. Could you post the model definition as well as the input shapes creating a minimal, executable code snippet showing the difference in the iteration time?

Hi ptrblck,

This is a snippet of the training part:

train_dl = torch.utils.data.DataLoader(train_ds, batch_size=8, num_workers=12, shuffle=True)

for epoch in range(nepochs):

    raft.eval()
    i3d.train()

    for i, batch in enumerate(tqdm(train_dl, leave=False)):

        xb, yb = tfm_x(batch[0]), batch[1]
        xb = xb.to(device)     #xb.shape = (2, 512, 3, 256, 256), xb.dtype = torch.uint8
        yb = yb.to(device)     #yb.shape = (8,),                  yb.dtype = torch.int64

        optimizer.zero_grad()

        with torch.no_grad():
            _, flow = raft.forward(xb[0], xb[1], iters=12, test_mode=True)
            flow = flow.view(-1, 64, 2, 256, 256).permute(0,2,1,3,4)   #new flow.shape = (8, 2, 64, 256, 256)

        with autocast(enabled=use_amp):
            outb = i3d.forward(flow)                                           #outb.shape = (8, 100)
            loss = criterion(outb, yb)

        scaler.scale(loss).backward()
        scaler.step(optimizer)
        scaler.update()

        scheduler.step()

raft is defined here: RAFT/raft.py at master · princeton-vl/RAFT · GitHub
and i3d is defined here: kinetics_i3d_pytorch/i3dpt.py at master · hassony2/kinetics_i3d_pytorch · GitHub, constructed with arg modality=‘flow’.

I ran the code above twice. The first run was on a single GPU with nn.DataParallel(model, device_ids=[0,1]) commented out. The second run was on 2 GPUs with DataParallel not commented out.

The average seconds per iteration on RTX 3090 of running the inner loop above were as follows:
1 GPU: 5.02 seconds/iteration
2 GPUs: 2.60 seconds/iteration

I also tried two values for num_workers in train_dl: 6 and 12. The measured seconds/iteration in both cases did not change.

Any thoughts?