Hello, I noticed that inference time is scaling poorly with bigger input size using resnet50 network:
input size: 224
batch size: 1: 6.78 (ms)
batch size: 4: 11.99 (ms)
batch size: 16: 40.49 (ms)
input size: 640
batch size: 1: 20.93 (ms)
batch size: 4: 84.83 (ms)
batch size: 16: 331.74 (ms)
Reproduction:
Code:
import time
from statistics import mean
import torch
import torchvision.models as models
if __name__ == "__main__":
if not torch.cuda.is_available():
print("cuda not available")
exit()
device = torch.device("cuda")
model = models.resnet50()
model = model.to(device)
model.eval()
input_sizes = [224,640]
batch_sizes = [1, 4, 16]
for input_size in input_sizes:
print(f"\ninput size: {input_size}")
for batch_size in batch_sizes:
# Warm-up
inputs = torch.randn(
batch_size, 3,input_size, input_size).to(device
)
with torch.no_grad():
_ = model(inputs)
measures = []
for _ in range(100):
inputs = torch.randn(
batch_size, 3,input_size, input_size).to(device
)
start_time = time.time()
with torch.no_grad():
_ = model(inputs)
torch.cuda.synchronize()
elapsed_time = time.time() - start_time
measures.append(elapsed_time)
mean_measure = mean(measures)
print(
f" batch size: {batch_size}: " f"{mean_measure*1e3:.2f} (ms)"
)
Environment: google colab with T4 GPU. I registered same behavior on L4 GPU with pytorch/pytorch:2.4.1-cuda12.1-cudnn9-devel docker.
Also, I monitored GPU utalization and memory usage on L4 and noticed its using 70% of GPU and ~10% memory with bigest batch with biggest input.
Also, I tried using different networks, even simple conv2d layer and registered same results.
Is that behaviour expected?