Inference time increases as batch size increases with GPU

I have noticed that, at inference time when using deeplabv3 model for image segmentation, doubling the batch size results in double the time for the inference (and viceversa).
I was supposing that, instead, until the GPU gets near saturation time will be constant for each batch, without depending on its size.

Here’s the reproducible code snippet that I am running on an ec2 istance (g4d.xlarge)
Pythorch version: 1.13.1
Cuda/GPU version and GPU memory on nvdia-smi picture below

import torch, time

model = torch.hub.load('pytorch/vision:v0.10.0', 'deeplabv3_resnet50', pretrained=True)
batch_size = 2
with torch.no_grad():
    device = 'cuda'
    dummy_input = torch.ones((batch_size, 3, 512, 512), dtype=torch.float32, device=device)
    
    print(f'batch_size = {batch_size}')
    print(f'dummy_input bytes: {dummy_input.storage().nbytes()}')

    model.to(device)
    model.eval()
    for i in range(5):
        # print(torch.cuda.memory_summary(abbreviated=True))
        torch.cuda.synchronize()
        start = time.time()
        _ = model(dummy_input)
        torch.cuda.synchronize()
        end = time.time()
        print(f'pass {i}: {end - start}')

I am aware of the GPU warm-up, so I am not considering the first iteration of the cycle.
These are the timings with respect to the various batch sizes

batch_size = 1
dummy_input bytes: 3145728
pass 0: 1.7911558151245117
pass 1: 0.09116387367248535
pass 2: 0.09141707420349121
pass 3: 0.09008550643920898
pass 4: 0.08789610862731934
-----------------------------------------
batch_size = 2
dummy_input bytes: 6291456
pass 0: 1.850174903869629
pass 1: 0.1694350242614746
pass 2: 0.16865038871765137
pass 3: 0.16701459884643555
pass 4: 0.1702117919921875
-----------------------------------------
batch_size = 4
dummy_input bytes: 12582912
pass 0: 2.0092501640319824
pass 1: 0.33000612258911133
pass 2: 0.3304569721221924
pass 3: 0.33622169494628906
pass 4: 0.3294081687927246
-----------------------------------------
batch_size = 8
dummy_input bytes: 25165824
pass 0: 2.2980120182037354
pass 1: 0.6141374111175537
pass 2: 0.6228690147399902
pass 3: 0.6222057342529297
pass 4: 0.6159627437591553
-----------------------------------------
batch_size = 16
dummy_input bytes: 50331648
pass 0: 2.885660409927368
pass 1: 1.2400434017181396
pass 2: 1.2422657012939453
pass 3: 1.2391083240509033
pass 4: 1.2441127300262451
-----------------------------------------
batch_size = 32
dummy_input bytes: 100663296
pass 0: 4.189502000808716
pass 1: 2.544891357421875
pass 2: 2.5518083572387695
pass 3: 2.5657997131347656
pass 4: 2.57613468170166

As we can see, double batch size = roughly double inference time. Which is not what I thought would happen.
Am I missing something? Is my way of timing incorrect?

I add also here the output of nvidia-smi before the execution

I also ran cuda memory summary for batch_size=32, nothing to me seems strange from memory side.
Here the snippet after last pass

|===========================================================================|
|                  PyTorch CUDA memory summary, device ID 0                 |
|---------------------------------------------------------------------------|
|            CUDA OOMs: 0            |        cudaMalloc retries: 0         |
|===========================================================================|
|        Metric         | Cur Usage  | Peak Usage | Tot Alloc  | Tot Freed  |
|---------------------------------------------------------------------------|
| Allocated memory      |    1602 MB |    5414 MB |  171062 MB |  169459 MB |
|---------------------------------------------------------------------------|
| Active memory         |    1602 MB |    5414 MB |  171062 MB |  169459 MB |
|---------------------------------------------------------------------------|
| GPU reserved memory   |   10596 MB |   10596 MB |   10596 MB |       0 B  |
|---------------------------------------------------------------------------|
| Non-releasable memory |    1165 MB |    1935 MB |   76126 MB |   74961 MB |
|---------------------------------------------------------------------------|
| Allocations           |     373    |     382    |    1171    |     798    |
|---------------------------------------------------------------------------|
| Active allocs         |     373    |     382    |    1171    |     798    |
|---------------------------------------------------------------------------|
| GPU reserved segments |      31    |      31    |      31    |       0    |
|---------------------------------------------------------------------------|
| Non-releasable allocs |       7    |      10    |     317    |     310    |
|---------------------------------------------------------------------------|
| Oversize allocations  |       0    |       0    |       0    |       0    |
|---------------------------------------------------------------------------|
| Oversize GPU segments |       0    |       0    |       0    |       0    |
|===========================================================================|

Thanks for any help!

I suspect that on a T4 machine that even batch size 1 is close to saturating compute due to the high FLOP count of the model.

Additionally, you do still see better than linear scaling e.g., 32 * 0.08789610862731934 < 2.57.

However, if you can isolate the scaling behavior to a particular layer that scales linearly while the others show roughly constant execution time, then I would be interested in investigating further.

Note: A back of the envelope calculation
Deeplab ResNet-50 @ 512x512 is about 178.72 GFLOPs
so @ batch size 1 we have

(178 * 10**9 / 0.08789610862731934) / 1e12 = or ~2 TFLOP/s

which represents ~25% MFU considering that the T4 is rated at 8.1 TFLOP/s @ fp32. You may see better scaling results when switching to half-precision or automatic mixed-precision, as the T4 is rated at 65 TFLOP/s (tensor-core).

Thanks a lot for the input :slight_smile:

That is true (I believe you meant ‘>’), it’s just not how much I was expecting.

For some reason I never considered mixed precision for inference, even though I used it for training quite a bunch! I tried it and, as expected, the execution time for a batch went down by a lot (around 4x time speedup).

However, the pattern of ‘linear scaling’ partially remained. Now you can see some global speedup in increasing the batch size, but still seems odd to me.

I tried a different backbone, resnet101, but with no changes. (Unsurprisingly)
However, trying with a random Unet implementation, every batch was performed at around the same speed.
So at this point should I assume that it is something related to the Deeplab implementation?

Whenever I will have time to dig into it I will happily share with you any results. It’s still bugging me a lot that it is behaving like that.

Here’s the new code with torch.autoscale for mixed-precision (also moved from torch.ones to torch.rand, but with negligible differences on timings)

import torch, time

model = torch.hub.load('pytorch/vision:v0.10.0', 'deeplabv3_resnet50', pretrained=True)
batch_size = 8
device = 'cuda'
autocast = True
model_flops = 178.72 * 1e9   # deeplab-resnet50 FLOPS
gpu_flops = 65.13 * 1e12   # T4 FLOPS for half-precision

model.to(device)
model.eval()

dummy_input = torch.rand((batch_size, 3, 512, 512), dtype=torch.float32, device=device)

with torch.no_grad():
    with torch.autocast(device_type=device, dtype=torch.float16, enabled=autocast):
        print(f'autocast {autocast}')
        print(f'batch_size = {batch_size}')
        print(f'dummy_input bytes: {dummy_input.storage().nbytes()}')

        for i in range(5):
            # print(torch.cuda.memory_summary(abbreviated=True))

            torch.cuda.synchronize()
            start = time.time()

            _ = model(dummy_input)

            torch.cuda.synchronize()
            end = time.time()

            print(f'pass {i}: {end - start}')

flops_per_batch = (model_flops / (end - start))  # use last iteration timings
gpu_usage = flops_per_batch / gpu_flops
print(f'flops per batch: {flops_per_batch / 1e12 :.6} TFLOPS')
print(f'gpu usage: {gpu_usage * 100 :.3}%')

And here some example runs with it

autocast True
batch_size = 1
dummy_input bytes: 3145728
pass 0: 1.698645830154419
pass 1: 0.03568291664123535
pass 2: 0.035680532455444336
pass 3: 0.03172445297241211
pass 4: 0.02527141571044922
flops used: 7.07202 TFLOPS
gpu stress: 10.9%
-----------------------------------------
autocast True
batch_size = 2
dummy_input bytes: 6291456
pass 0: 1.7061800956726074
pass 1: 0.0683753490447998
pass 2: 0.047356367111206055
pass 3: 0.04664158821105957
pass 4: 0.04660940170288086
flops used: 3.83442 TFLOPS
gpu stress: 5.89%
-----------------------------------------
autocast True
batch_size = 4
dummy_input bytes: 12582912
pass 0: 1.7384240627288818
pass 1: 0.1050419807434082
pass 2: 0.0797419548034668
pass 3: 0.07993698120117188
pass 4: 0.08047652244567871
flops used: 2.22077 TFLOPS
gpu stress: 3.41%
-----------------------------------------
autocast True
batch_size = 8
dummy_input bytes: 25165824
pass 0: 1.8271241188049316
pass 1: 0.1599128246307373
pass 2: 0.1597745418548584
pass 3: 0.15886497497558594
pass 4: 0.15883398056030273
flops per batch: 1.1252 TFLOPS
gpu usage: 1.73%
-----------------------------------------
autocast True
batch_size = 16
dummy_input bytes: 50331648
pass 0: 1.9817469120025635
pass 1: 0.3104076385498047
pass 2: 0.31354403495788574
pass 3: 0.3140420913696289
pass 4: 0.3165256977081299
flops per batch: 0.56463 TFLOPS
gpu usage: 0.867%
-----------------------------------------
autocast True
batch_size = 32
dummy_input bytes: 100663296
pass 0: 2.3571126461029053
pass 1: 0.6947810649871826
pass 2: 0.6931068897247314
pass 3: 0.6963107585906982
pass 4: 0.6951370239257812
flops per batch: 0.2571 TFLOPS
gpu usage: 0.395%