Hi, I am comparing the training speed of the provided pytorch video models (such as Resnet-3D) on single A100 GPU (80GB GPU) vs two A100 (160GB total). I use the exact same training parameters in both cases, including the same batch size. The batch size I choose can fit into one A100 memory and does not need another A100.

With 2 A100, the training run twice as fast as 1 A100 when using mixed-precision. I also rerun the same experiment using single-precision and with RTX 3090 instead of A100 with similar outcomes. Is this expected behavior or is something wrong with my code? Please see the minimal executable code below along with the measured times and speeds.

**My environment info:**

Python version: 3.10

PyTorch Build: 1.13.1

CUDA version: 11.7

**Executable code**:

```
import torch
import torchvision
import os
import time
from torch import tensor, Tensor
import torch.nn as nn
import torch.nn.functional as F
from torch.cuda.amp import autocast
from torch.cuda.amp import GradScaler
from torch.utils.data import Dataset, DataLoader
from torchvision import models
from tqdm.notebook import tqdm
BATCH_SIZE = 16
n_workers = 12
class TrainDataset(Dataset):
def __len__(self):
return 1760
def __getitem__(self, idx):
# Generate a random batch of inputs and labels
x = torch.randint(0, 255, size=(3, 64, 224, 224), dtype=torch.uint8)
y = torch.randint(0, 400, (1,))[0]
return x, y
train_ds = TrainDataset()
train_dl = DataLoader(train_ds, batch_size=BATCH_SIZE, num_workers=n_workers, shuffle=True)
model = torchvision.models.video.r3d_18()
def train(model, train_dl, num_gpus, use_amp, warmup_iters):
total_iters = len(train_dl)
assert total_iters > 2*warmup_iters
if num_gpus > 1:
model = torch.nn.DataParallel(model, device_ids=list(range(num_gpus)))
model.cuda()
scaler = GradScaler(enabled=use_amp)
criterion = torch.nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(model.parameters(), lr=1e-4)
for i, batch in enumerate(tqdm(train_dl, leave=False)):
#start timer after the first warmup_iters
if i == warmup_iters:
start = time.time()
xb, yb = batch[0].cuda(), batch[1].cuda()
xb = xb / 255
optimizer.zero_grad()
with autocast(enabled=use_amp):
out = model(xb)
loss = criterion(out, yb)
scaler.scale(loss).backward()
scaler.step(optimizer)
scaler.update()
end = time.time()
elapsed = end - start
speed = (total_iters-warmup_iters) / elapsed
print(f'Time elapsed: {elapsed} seconds')
print(f'Avg speed: {speed} iters/sec' )
```

**Outputs with/without mixed-precision training for 1 vs 2 A100 GPUs:**

`train(model, train_dl, num_gpus=1, use_amp=True, warmup_iters=10)`

Time elapsed: 70.61007738113403 seconds

Avg speed: 1.4162284437138788 iters/sec

`train(model, train_dl, num_gpus=1, use_amp=False, warmup_iters=10)`

Time elapsed: 116.88389921188354 seconds

Avg speed: 0.8555498291404796 iters/sec

`train(model, train_dl, num_gpus=2, use_amp=True, warmup_iters=10)`

Time elapsed: 37.92261838912964 seconds

Avg speed: 2.6369487194656522 iters/sec

`train(model, train_dl, num_gpus=2, use_amp=False, warmup_iters=10)`

Time elapsed: 59.5269513130188 seconds

Avg speed: 1.6799113308214992 iters/sec