Say you have a nn.ModuleList with many layers in it. In each iteration, you select some of the layers at random.

I noticed that the reserved GPU VRAM as reported by nvidia-smi keeps increasing whenever a previously not used layer is selected.

I would rather expect that the memory requirements would remain constant since activations and gradients only need to be computed for the currently active layers.

Is there any way to prevent PyTorch from allocating more and more memory?

Code for reproduction:

```
n_layers = 1000
net = nn.ModuleList([nn.Linear(100,100) for i in range(n_layers)]).cuda()
optimizer = torch.optim.Adam(net.parameters(), lr=1e-3)
criterion = nn.MSELoss()
iterations = 1000
net.train()
for _ in range(iterations):
data = torch.zeros((64, 100)).cuda() # some dummy data
optimizer.zero_grad()
idx = torch.randint(0, n_layers, (2,)) # 2 layers will be selected at random
result = 0
for i in idx:
result += net[i](data) # dummy operation
loss = criterion(result, torch.ones_like(data)) # dummy loss
loss.backward()
optimizer.step()
```