Hello everyone,
I have one question. Does multiple similar forwards with different inputs/outputs trigger multiple builds of the computational graph knowing that the graph is exactly the same, if yes, how much this affect the memory usage?
Thanks!
Hello everyone,
I have one question. Does multiple similar forwards with different inputs/outputs trigger multiple builds of the computational graph knowing that the graph is exactly the same, if yes, how much this affect the memory usage?
Thanks!
Yes, different forward passes will create different computation graphs, which will store the different intermediate forward activations (which are needed for the gradient computation in the backward pass). The increase in device memory depends on the model architecture and the size of forward activations which need to be stored.
Here is a simple example using a ResNet50
:
import torch
import torchvision.models as models
print(torch.cuda.memory_allocated() / 1024**2)
# 0.0
device = "cuda"
model = models.resnet50().to(device)
x = torch.randn(1, 3, 224, 224, device=device)
print(torch.cuda.memory_allocated() / 1024**2)
#98.30224609375
out1 = model(x)
print(torch.cuda.memory_allocated() / 1024**2)
# 189.58642578125
out2 = model(x)
print(torch.cuda.memory_allocated() / 1024**2)
# 271.33935546875
Calling backward
will free the intermediate forward activations, but the first backward
call will also create the grad
attributes which will use memory.
Agreed, now Iām certain! Thanks