At the beginning of the training, I have created a neural network NN.
I create optimizer by
optimizer = optim.Adam(NN.parameters(), lr=1e-3)
During the training, I’m adding to new layers to this network. (Imagining dynamically increasing number of layers of residual network).
optimizer.add_param_group({"params": new_layer_params}) at each iteration when new layer created.
However, when I add a new layer to my NN ; I want to train solely my new layer parameters for few steps; that is, ignoring the previous layer parameters and only trains the newest layer for T steps (only optimize the newly added parameters). Then after this solely training, I will start fully train my NN (optimize all layers parameters).
How should I do this?
My current attentative approaches:
(1) create a list of optimizers; where each optimizer is responsible to optimize the parameters of each layer.
opt = [] # collection of parameters
optimizer = Adam(NN.parameters()) # my first layer parameters
opt.append(optimizer)
for l in range(total_number_of_layers):
# add new layers to NN
... some code here
# add new optimizer to collections; this optimizer only responsible take care of new layer parameters
opt.append(Adam(new_layer_parameters))
# only train new layers
for t in range(T1):
opt[-1].zero_grad()
loss = get_loss(...)
loss.backward(retain_graph=True)
opt[-1].step() # only update new parameters
# Fully train
for t in range(T2):
for o in opt:
o.zero_grad()
loss = get_loss(...)
loss.backward(retain_graph=True)
for o in opt:
o.step()
Is there any nicer way of doing this? Am I using retain_graph=True correctly ? My program takes very large amount of memory so I don’t think I’m doing it efficiently
Both cases you mentioned above are right. But don’t forget dynamic update the output of the network when you increasing number of layers dynamically. I mean to say if you increasing layers into the current network by model.add_module(), it will not update the forward() process, so you should update the output like output = new_layer(model(input)) manually. Maybe there is a better solution.
I think you shouldn’t use retain_graph=True in each backward() which leads to large amount of memory even OOM problem. In addition, there are some problems in you code snippet.
optimizer = Adam(NN.parameters()) # the frist optimizer of NN parameters
for l in range(total_number_of_layers):
# add new layers
...
optimizer.add_param_group({"params": new_module.parameters()})
optimizer_new_layer = Adam(new_module.parameters())
# only train new layers
for t in range(T1):
optimizer_new_layer.zero_grad()
loss = get_loss(...)
loss.backward() # don't need retrain_graph
optimizer_new_layer.step()
# Fully train
for t in range(T2):
optimizer.zero_grad()
loss = get_loss(...)
loss.backward()
optimizer.step()
Does it still have a large memory consumption after removing retrain_graph=True? Let me know.
So; removing retain_graph=True. it raises RuntimeError: Trying to backward through the graph a second time, but the buffers have already been freed. Specify retain_graph=True
I think because loss = get_loss(..) been called twice
It seems there are some variables that calculated in global and used for backpropagation. But in my opinion, the large amount memory cost is caused by retain_graph that calling backward() cannot clear the graph in each iteration. As a result, memory used is gradually increasing.
Maybe the solution is to optimize your get_loss(...) function?