GPU memory increases in conditional computation

mathnow · November 14, 2022, 5:03pm

Say you have a nn.ModuleList with many layers in it. In each iteration, you select some of the layers at random.

I noticed that the reserved GPU VRAM as reported by nvidia-smi keeps increasing whenever a previously not used layer is selected.

I would rather expect that the memory requirements would remain constant since activations and gradients only need to be computed for the currently active layers.

Is there any way to prevent PyTorch from allocating more and more memory?

Code for reproduction:

n_layers = 1000

net = nn.ModuleList([nn.Linear(100,100) for i in range(n_layers)]).cuda()
optimizer = torch.optim.Adam(net.parameters(), lr=1e-3)

criterion = nn.MSELoss()

iterations = 1000

net.train()
for _ in range(iterations):
    data = torch.zeros((64, 100)).cuda() # some dummy data

    optimizer.zero_grad()

    idx = torch.randint(0, n_layers, (2,)) # 2 layers will be selected at random

    result = 0
    for i in idx:
        result += net[i](data) # dummy operation

    loss = criterion(result, torch.ones_like(data)) # dummy loss
    loss.backward()
    optimizer.step()

ptrblck · November 14, 2022, 10:42pm

The increasing memory usage is caused by:

the lazily initialized state of the Adam optimizer and
the gradients which are also lazily initialized and zeroed out in each iteration (the zeros would still be allocated in memory)

You could delete the .grad attributes by calling optimizer.zero_grad(set_to_none=True) and swap to an optimizer without internal states (e.g. SGD) and would see that the memory usage is constant.
If you want to keep using Adam then note that the memory usage “saturates” once all running estimates were calculated for each layer.