Using two neural network modules to optimize only one

likethevegetable · March 12, 2019, 1:38am

I am new to PyTorch and am trying to build a neural network that has two sequential networks trained and evaluated in this particular sequence:

Part 1: the ‘latter’ part of the network is trained in isolation.
Part 2: the ‘former’ part of the network is trained with its output fed to the ‘latter’ part of the network (held constant), and the output of the ‘latter’ part is used for the loss.
Part 3: data is evaluated with the ‘former’ part of the network only.

My question is: I’m assuming I should have the ‘latter’ gradients known in part 2, correct? I would like to have the functions of the ‘latter’ part of the network operate on the output of the ‘former’ part, but NOT update the ‘latter’ parts parameters in the process.

Based on the example below, will PyTorch “know” that ‘latter’ operated on the loss, and include it in the computation graph and pass its gradients backwards? Assume the latter is already trained in the example below.

Any insight is highly appreciated


inp_var, out_var = Variable(field), Variable(lens)  # from torch.utils.data.DataLoader

optimizer.zero_grad()  # reset gradients to 0 for new batch

output = former_model(inp_var)  # forward pass of former model (to be trained)

output = latter_net(output)  # latter_net is pre-trained, output of former net fed to latter

loss = criterion(output, inp_var)  # loss function, comparing loss against the INPUT to the former

loss.backward()  # backward pass

optimizer.step()

MariosOreo · March 12, 2019, 4:00am

Hi,

I am not sure if I understand your question.
It seems that you have two network: subnetwork1 and subnetwork2, subnetwork2 is pre-trained, and what you want looks like as below:

data -> subnetwork1 -> output1 -> subnetwork2 -> loss

And the params of pre-trained subnetwork2 do not update.
I think you can set requires_grad=False to params of subnetwork2, this thread may help you.

Sunshine352 · March 12, 2019, 8:56am

Yes, PyTorch “know” that ‘latter’ operated on the loss, and will include it in the computation graph, and automatically backward its gradients.
If you want not to update of net-parameters of latter_net, you should make param.requires_grad = False before use latter_net like below:
for param in latter_net.parameters():
param.requires_grad = False

DoubtWang · March 12, 2019, 11:00am

set the optimizer only optimization the parameters of former_model:

optimizer.set_parameters(former_model.parameters())

likethevegetable · March 12, 2019, 1:30pm

Yes that is exactly what I was trying to describe, thank you for clarifying and sending a helpful link.

likethevegetable · March 12, 2019, 1:34pm

Thank you, I had set up two separate optimizers for each sub-network in a similar fashion, and as a check saved the ‘latter’ model (again as a different file) after optimizing the ‘former’ with the ‘latter’ held constant, and it appears to have not changed as I desired.

Do you think setting requires_grad = False is not necessary then?

likethevegetable · March 12, 2019, 1:52pm

Interesting - I read into the PyTorch documentation and it mentions the same remedy for freezing parts of the network.

So to get this straight: even though requires_grad=False is set, it will still compute the gradients for that part of the network, but simply not update the weights? The name requires_grad seems a bit misleading, as the gradients are indeed calculated… ?

Sunshine352 · March 13, 2019, 3:34am

It’s necessary. If you do not set requires_grad = False，and just optimizer.set_parameters(former_model.parameters()) to enable not to update parameters,
but latter_model’s gradients are still computed, causing to take up much GPU-memory

likethevegetable · March 13, 2019, 12:48pm

Thank you for the reply. Assuming memory is not a problem, I would still achieve the same result though. I’m more concerned about my conceptual understanding at this point, but since it is invariably recommended to set requires_grad = False of course I will do so to improve memory consumption.