I have a model architecture which is a composition of two other models with distinct parameters. So basically my model can be written as y = f(g(x)) where f and g are modules with different sets of parameters.

I have two objectives during learning. One is the main learning objective, and the other is an entropy regularization term. I want to apply these two objectives simultaneously (i.e., on the same inputs with the same state of the network parameters) but I want the main objective to apply only to g's parameters, and the entropy to apply only to f's parameters.

I can set up my code in a way where I can apply the main objective to g only and the entropy to both f and g:

# Optimizer gets parameters from both g and f.
optimizer = create_optimizer(g, f)
# Forward pass and compute the objectives.
z = g(x)
y = f(z)
main_obj = obj_fn(x, y)
h_obj = -entropy(y)
# Compute gradient of main obj wrt. intermediate tensor(s) z, then apply that loss to the paramgers of g.
z_grad = torch.autograd.grad(main_obj, z, retain_graph = True)
torch.autograd.backward(z, z_grad)
# Optimize the entropy wrt. the entire network.
h_obj.backward()
# Update the parameters
optimizer.step()

How can I instead apply the entropy objective only to f's parameters? I know that I can split off z into a version of z detached from g's graph, and then when I optimize the entropy on it, the gradients will only apply to the parameters of f:

but that would require a lot of extra bookkeeping and make the code messy. Is there an easier way I can just compute and apply the gradients over f's parameters?

One way is to use optimizer on grouped parameter. If optimizer won’t update your weights the loss don’t have any affect. In that way you can keep the same loss function for all parameters and just filter in the optimizer.

So basically having two optimizers, one for the parameters of f and one for the parameters of g? I guess I would have to be careful about the order that I do backward and step, right? Because I’d want the gradients to be computed wrt. the parameters before any of them are updated. So I’d probably want to do backward for both objectives first, then step for the two optimizers. Does that sound right?

Looks a bit complicated to me. If you can write your objective function with x, y, z variables then we can use obj_fn(x, y.detach(), z) , it is like giving up code reusability for bug free gradient flow.

Yea, my code is already written in a way where that will involve a lot of extra bookkeeping.

I found something that works though (simplified from my code):

# Create two distinct optimizers
f_opt = create_optimizer(f.parameters())
g_opt = create_optimizer(g.parameters())
# Forward pass
z = g(x)
y = f(z)
# Update parameters of g only by computing gradient wrt. z first. Gradient on f's params should be None after this.
main_obj = obj_fn(x, y)
z_grad = torch.autograd.grad(main_obj, z, retain_graph = True)
torch.autograd.backward(z, z_grad, retain_graph = True)
g_optimizer.step()
# Update parameters of f only by doing a backwards pass of the entropy objective. Gradient on g's params are not None going into this, but because f_optimizer is only over g's params, that doesn't matter.
h_obj = -entropy(y)
h_obj.backward()
f_optimizer.step()

And I am pretty sure this works (verified which params have None grads after which steps).