Multiple forwards and comp graph building

Hello everyone,
I have one question. Does multiple similar forwards with different inputs/outputs trigger multiple builds of the computational graph knowing that the graph is exactly the same, if yes, how much this affect the memory usage?

Thanks!

Yes, different forward passes will create different computation graphs, which will store the different intermediate forward activations (which are needed for the gradient computation in the backward pass). The increase in device memory depends on the model architecture and the size of forward activations which need to be stored.

Here is a simple example using a ResNet50:

import torch
import torchvision.models as models 

print(torch.cuda.memory_allocated() / 1024**2)
# 0.0

device = "cuda"
model = models.resnet50().to(device)
x = torch.randn(1, 3, 224, 224, device=device)

print(torch.cuda.memory_allocated() / 1024**2)
#98.30224609375

out1 = model(x)
print(torch.cuda.memory_allocated() / 1024**2)
# 189.58642578125

out2 = model(x)
print(torch.cuda.memory_allocated() / 1024**2)
# 271.33935546875

Calling backward will free the intermediate forward activations, but the first backward call will also create the grad attributes which will use memory.

1 Like

Agreed, now I’m certain! Thanks :slight_smile: