How to take weighted combination of models and calculate the gradient wrt combination weight?

Suppose that I have two model M1 and M2. I want to make a joint model M=a1M1+a2M2, later I want to calculate the gradient of M wrt a1 and a2 (i.e. wrt combination weight).

Could anyone suggest how to do that?


If you define a1 and a2 as tensors with requires_grad=True or as nn.Parameters, Autograd will compute their gradients in the backward pass.
I’m not sure how the "gradient of M wrt aX" would be calculated in your example, but if you compute a loss using M, you would be able to compute the gradient of this loss function with respect to the parameters.

I’m not familiar with your use case, but note that optimizing these parameters (aX) could just push them to negative values, which would then “minimize” the loss.

Thanks, @ptrblck for the reply. My main concern is how to make joint model M so that model knows about all the M1, M2etc.
I tried M=a1M1.parameters()+a2M2.parameters()
But now model M does not know about M1 and M2.
Here my objective is using a convex combination of a1 and a2 to learn a joint model that minimize the loss.


This line suggests that you want to do parameter averaging with replicated networks. That won’t work as network re-runs converge to distinct local minima.

OTOH, network output averaging (ensembles) can work, but you normally need a staged training for that. I.e. combination weights should be trained with independently pre-trained contributing models.

Thanks, @googlebot for your reply.

Actually, my idea is something different. Suppose that I initialise the a1=a2=0.5, then make the gradient of M1 and M2 false and pass to the data to M and find out y=M(x). Now calculate the entropy (E) of y take the gradient of E wrt a1 and a2. Then based on the gradient of a1 and a2 decide which model will minimise the entropy. In this way, I can select the model M1 or M2 just by a single pass, instead of passing data to all model.

For the two models, it looks costly but if we have many models then it can be much efficient. Instead of linear computation growth wrt the number of model, it will be much efficient. The idea is something like (NeurIPS, 20)


That paper links to a pytorch implementation. Looks pretty intrusive, but perhaps that’s necessary.