Say you have a nn.ModuleList with many layers in it. In each iteration, you select some of the layers at random.
I noticed that the reserved GPU VRAM as reported by nvidia-smi keeps increasing whenever a previously not used layer is selected.
I would rather expect that the memory requirements would remain constant since activations and gradients only need to be computed for the currently active layers.
Is there any way to prevent PyTorch from allocating more and more memory?
Code for reproduction:
n_layers = 1000 net = nn.ModuleList([nn.Linear(100,100) for i in range(n_layers)]).cuda() optimizer = torch.optim.Adam(net.parameters(), lr=1e-3) criterion = nn.MSELoss() iterations = 1000 net.train() for _ in range(iterations): data = torch.zeros((64, 100)).cuda() # some dummy data optimizer.zero_grad() idx = torch.randint(0, n_layers, (2,)) # 2 layers will be selected at random result = 0 for i in idx: result += net[i](data) # dummy operation loss = criterion(result, torch.ones_like(data)) # dummy loss loss.backward() optimizer.step()