Hey;
At the beginning of the training, I have created a neural network NN
.
I create optimizer by
optimizer = optim.Adam(NN.parameters(), lr=1e-3)
During the training, I’m adding to new layers to this network. (Imagining dynamically increasing number of layers of residual network).
optimizer.add_param_group({"params": new_layer_params})
at each iteration when new layer created.
However, when I add a new layer to my NN
; I want to train solely my new layer parameters for few steps; that is, ignoring the previous layer parameters and only trains the newest layer for T
steps (only optimize the newly added parameters). Then after this solely training, I will start fully train my NN
(optimize all layers parameters).
How should I do this?
My current attentative approaches:
(1) create a list of optimizers; where each optimizer is responsible to optimize the parameters of each layer.
opt = [] # collection of parameters
optimizer = Adam(NN.parameters()) # my first layer parameters
opt.append(optimizer)
for l in range(total_number_of_layers):
# add new layers to NN
... some code here
# add new optimizer to collections; this optimizer only responsible take care of new layer parameters
opt.append(Adam(new_layer_parameters))
# only train new layers
for t in range(T1):
opt[-1].zero_grad()
loss = get_loss(...)
loss.backward(retain_graph=True)
opt[-1].step() # only update new parameters
# Fully train
for t in range(T2):
for o in opt:
o.zero_grad()
loss = get_loss(...)
loss.backward(retain_graph=True)
for o in opt:
o.step()
However, I notice that retain_graph=True
is extremely memory inefficient. My program takes very large amount of memory.
I’m thinking that when I only train new layers; loss.backward()
is take gradient w.r.t. all parameters (including old layers). I’m thinking if its possible that for this snip code; I can detach()
old layers and only backprogate the new layer parameters.
=========================================
according to the first reply;
Continuing the discussion from How to backward only a subset of neural network parameters? (temporally detach some parameters):
I have changed the code to
opt = [] # collection of parameters
optimizer = Adam(NN.parameters()) # my first layer parameters
opt.append(optimizer)
for l in range(total_number_of_layers):
# add new layers to NN
... some code here
# add new optimizer to collections; this optimizer only responsible take care of new layer parameters
opt.append(Adam(new_layer_parameters))
# detach previous parameters
for name, param in NN.name_parameters():
if name in previous: # some code here to check
param.require_grad=False
# only train new layers
for t in range(T1):
opt[-1].zero_grad()
loss = get_loss(...)
loss.backward(retain_graph=True)
opt[-1].step() # only update new parameters
# attach them back
for param in NN.parameters():
param.require_grad=True
# Fully train
for t in range(T2):
for o in opt:
o.zero_grad()
loss = get_loss(...)
loss.backward()
for o in opt:
o.step()
However, at first iteration of l
, we skip # only train new layers
part because we have no previous layers, we run the code in # Fully train
part. Then at second iteration, we run # only train new layers
part; it once again raises Error that I have to use retain_graph=True