Relationship between GPU Memory Usage and Batch Size

legendu · September 17, 2021, 6:33pm

I wonder does the GPU memory usage rough has a linear relationship with the batch size used in training?

I was fine tune ResNet152. With a batch size 8, the total GPU memory used is around 4G and when the batch size is increased to 16 for training, the total GPU memory used is around 6G. The model itself takes about 2G. It seems to me the GPU memory consumption of training ResNet 152 is approximately 2G + 2G * batch_size / 8?

ptrblck · September 19, 2021, 3:35am

The batch size would increase the activation sizes during the forward pass, while the model parameter (and gradients) would still use the same amount of memory as they are not depending on the used batch size. This post explains the memory usage in more detail.

ndvbd · May 11, 2023, 8:03am

This is a function that I wrote to calculate the activation size of a network, in order to find out how much can you increase the batch size:

total_output_elements = 0
def calc_total_activation_size(model, call_the_network_function):
    global total_output_elements
    total_output_elements = 0

    def hook(module, input, output):
        global total_output_elements
        total_output_elements += output.numel()
        
    handle = torch.nn.modules.module.register_module_forward_hook(hook)
    result = call_the_network_function()
    handle.remove()
    return result, total_output_elements

sad_robot · July 15, 2024, 1:23am

Hi ptrblck! I’ve had a different experience (where I get OOM from gradient) when I am getting many (random) outputs from a model & accumulating them inside a tensor. I would’ve imaged based on your comment this would behave comparably to having larger batch sizes where gradient doesn’t take additional memory… Could you take a look at my question?

ptrblck · July 15, 2024, 1:22pm

Your use case accumulates the computation graph and thus all forward activations, which do depend on the batch size.
You are also not calling backward() and are thus never creating the gradients.

sad_robot · July 15, 2024, 8:36pm

In my real problem the loss depends on the final accumulated tensor after several samples.
Does that mean that there is no way to lower the memory requirements of the gradient tape?

P.S.
I was thinking maybe I could somehow accumulate the Jacobean of the accumulated tensor w.r.t to the model params to keep memory constant? But idk how to translate this to the auto-grad package.

ptrblck · July 15, 2024, 11:09pm

Not if you really need to accumulate all computation graphs.

I don’t know how the accumulated graph is used, but in case you are computing gradients with it directly, you could also call backward on each sub-graph and accumulate the gradients instead (which hace a constant memory requirement as mentioned before).

sad_robot · July 16, 2024, 7:04pm

Is it possible to call backwards early if the loss depends on the accumulated metric?

ptrblck · July 17, 2024, 3:56pm

It depends on your use case and if you are simply accumulating the computation graph.
Here is a simple example showing the gradients will be equal (up to floating point precision) for 3 different approaches:

using the entire batch at once
accumulating the computation graph and calling backward on the accumulated loss
calling backward on each loss and accumulating the gradients

import torch
import torch.nn as nn
import torchvision.models as models

# setup
model = models.resnet18().eval()
criterion = nn.CrossEntropyLoss()

x = torch.randn(10, 3, 224, 224)
target = torch.randint(0, 1000, (10,))

# single pass
out = model(x)
loss = criterion(out, target)
loss.backward()
grads_ref = [param.grad.clone() for param in model.parameters()]
model.zero_grad()

# accumulate computaion graph
loss_total = 0.
for i in range(x.size(0)):
    x_ = x[i:i+1]
    y_ = target[i:i+1]
    out = model(x_)
    loss = criterion(out, y_)
    loss_total += loss
# scale since loss was accumulated
loss_total = loss_total / x.size(0)
loss_total.backward()

grads_acc =  [param.grad.clone() for param in model.parameters()]
model.zero_grad()

# backward on each sample
for i in range(x.size(0)):
    x_ = x[i:i+1]
    y_ = target[i:i+1]
    out = model(x_)
    loss = criterion(out, y_)
    # scale loss as the gradients will be accumulated
    loss = loss / x.size(0)
    loss.backward()
grads_acc_mult = [param.grad.clone() for param in model.parameters()]
model.zero_grad()


# check
for g_ref, g_acc, g_acc_mult in zip(grads_ref, grads_acc, grads_acc_mult):
    print((g_ref - g_acc).abs().max())
    print((g_ref - g_acc_mult).abs().max())