How to memory efficiently increase neural network layers dynamically

ElleryL · April 15, 2019, 7:35pm

Hey;

At the beginning of the training, I have created a neural network NN.

I create optimizer by

optimizer = optim.Adam(NN.parameters(), lr=1e-3)

During the training, I’m adding to new layers to this network. (Imagining dynamically increasing number of layers of residual network).

optimizer.add_param_group({"params": new_layer_params}) at each iteration when new layer created.

However, when I add a new layer to my NN ; I want to train solely my new layer parameters for few steps; that is, ignoring the previous layer parameters and only trains the newest layer for T steps (only optimize the newly added parameters). Then after this solely training, I will start fully train my NN (optimize all layers parameters).

How should I do this?

My current attentative approaches:

(1) create a list of optimizers; where each optimizer is responsible to optimize the parameters of each layer.

opt = [] # collection of parameters

optimizer = Adam(NN.parameters()) # my first layer parameters
opt.append(optimizer)

for l in range(total_number_of_layers):
    # add new layers to NN
    ... some code here
 
   # add new optimizer to collections; this optimizer only responsible take care of new layer parameters
    opt.append(Adam(new_layer_parameters)) 

    # only train new layers
    for t in range(T1):
        opt[-1].zero_grad()
        loss = get_loss(...)
        loss.backward(retain_graph=True)
        opt[-1].step() # only update new parameters

    # Fully train
    for t in range(T2):
        for o in opt:
             o.zero_grad()
        loss = get_loss(...)
        loss.backward(retain_graph=True)
        for o in opt:
             o.step()

Is there any nicer way of doing this? Am I using retain_graph=True correctly ? My program takes very large amount of memory so I don’t think I’m doing it efficiently

MariosOreo · April 16, 2019, 1:43am

Hello @ElleryL,

Both cases you mentioned above are right.
But don’t forget dynamic update the output of the network when you increasing number of layers dynamically. I mean to say if you increasing layers into the current network by model.add_module(), it will not update the forward() process, so you should update the output like output = new_layer(model(input)) manually. Maybe there is a better solution.

ElleryL · April 16, 2019, 9:02pm

Thanks for the reply; I have also encounter another issue.

I have re-edit the post; could you also take a look at it? Thanks

MariosOreo · April 17, 2019, 5:44am

I think you shouldn’t use retain_graph=True in each backward() which leads to large amount of memory even OOM problem. In addition, there are some problems in you code snippet.

optimizer = Adam(NN.parameters()) # the frist optimizer of NN parameters
for l in range(total_number_of_layers):
    # add new layers
    ...
    optimizer.add_param_group({"params": new_module.parameters()})
    optimizer_new_layer = Adam(new_module.parameters())

    # only train new layers
    for t in range(T1):
        optimizer_new_layer.zero_grad()
        loss = get_loss(...)
        loss.backward()      # don't need retrain_graph
        optimizer_new_layer.step()

    # Fully train
    for t in range(T2):
        optimizer.zero_grad()
        loss = get_loss(...)
        loss.backward()
        optimizer.step()

Does it still have a large memory consumption after removing retrain_graph=True? Let me know.

ElleryL · April 17, 2019, 2:20pm

So; removing retain_graph=True. it raises RuntimeError: Trying to backward through the graph a second time, but the buffers have already been freed. Specify retain_graph=True

I think because loss = get_loss(..) been called twice

MariosOreo · April 18, 2019, 12:36am

It seems there are some variables that calculated in global and used for backpropagation. But in my opinion, the large amount memory cost is caused by retain_graph that calling backward() cannot clear the graph in each iteration. As a result, memory used is gradually increasing.
Maybe the solution is to optimize your get_loss(...) function?