I wonder does the GPU memory usage rough has a linear relationship with the batch size used in training?
I was fine tune ResNet152. With a batch size 8, the total GPU memory used is around 4G and when the batch size is increased to 16 for training, the total GPU memory used is around 6G. The model itself takes about 2G. It seems to me the GPU memory consumption of training ResNet 152 is approximately 2G + 2G * batch_size / 8?
The batch size would increase the activation sizes during the forward pass, while the model parameter (and gradients) would still use the same amount of memory as they are not depending on the used batch size. This post explains the memory usage in more detail.
Hi ptrblck! I’ve had a different experience (where I get OOM from gradient) when I am getting many (random) outputs from a model & accumulating them inside a tensor. I would’ve imaged based on your comment this would behave comparably to having larger batch sizes where gradient doesn’t take additional memory… Could you take a look at my question?
Your use case accumulates the computation graph and thus all forward activations, which do depend on the batch size.
You are also not calling backward() and are thus never creating the gradients.
In my real problem the loss depends on the final accumulated tensor after several samples.
Does that mean that there is no way to lower the memory requirements of the gradient tape?
P.S.
I was thinking maybe I could somehow accumulate the Jacobean of the accumulated tensor w.r.t to the model params to keep memory constant? But idk how to translate this to the auto-grad package.
Not if you really need to accumulate all computation graphs.
I don’t know how the accumulated graph is used, but in case you are computing gradients with it directly, you could also call backward on each sub-graph and accumulate the gradients instead (which hace a constant memory requirement as mentioned before).
It depends on your use case and if you are simply accumulating the computation graph.
Here is a simple example showing the gradients will be equal (up to floating point precision) for 3 different approaches:
using the entire batch at once
accumulating the computation graph and calling backward on the accumulated loss
calling backward on each loss and accumulating the gradients
import torch
import torch.nn as nn
import torchvision.models as models
# setup
model = models.resnet18().eval()
criterion = nn.CrossEntropyLoss()
x = torch.randn(10, 3, 224, 224)
target = torch.randint(0, 1000, (10,))
# single pass
out = model(x)
loss = criterion(out, target)
loss.backward()
grads_ref = [param.grad.clone() for param in model.parameters()]
model.zero_grad()
# accumulate computaion graph
loss_total = 0.
for i in range(x.size(0)):
x_ = x[i:i+1]
y_ = target[i:i+1]
out = model(x_)
loss = criterion(out, y_)
loss_total += loss
# scale since loss was accumulated
loss_total = loss_total / x.size(0)
loss_total.backward()
grads_acc = [param.grad.clone() for param in model.parameters()]
model.zero_grad()
# backward on each sample
for i in range(x.size(0)):
x_ = x[i:i+1]
y_ = target[i:i+1]
out = model(x_)
loss = criterion(out, y_)
# scale loss as the gradients will be accumulated
loss = loss / x.size(0)
loss.backward()
grads_acc_mult = [param.grad.clone() for param in model.parameters()]
model.zero_grad()
# check
for g_ref, g_acc, g_acc_mult in zip(grads_ref, grads_acc, grads_acc_mult):
print((g_ref - g_acc).abs().max())
print((g_ref - g_acc_mult).abs().max())