Hello everyone! I’m trying to find a way to train a linear layer (vector) that will optimize the way a number of pytorch models are combined, but I’m not sure where to start.

The idea would be something like: having 4 independently trained models (same network, shapes and number of weights), a linear layer that has a weight that corresponds to each input model and what I would like to do is having the resulting combination evaluated with a training dataset and backpropagate the error to modify only the linear layer, so it can be optimized given a loss function.

It’s a bit similar to the question asked in Combining Trained Models in PyTorch, but in my case I’m working with CNN networks, and instead would like to optimize how to combine the models instead of combining the results of the models.

The posted approach is independent of the model architecture and just combines the model outputs before feeding them to a final classifier.
You could also manipulate the sumbodels to output the penultimate activation, if needed.

Could you explain, what the first approach would do in comparison the combining the results, please?

Yes, basically I want to combine the 4 models’ weights, to eventually end up with a single model, instead of combining the results, which would be an ensemble.

In that case you could get the state_dicts from each model, average all parameters (or use another reduction instead of the mean), and reload the state_dict to a single model.

However, I’m very skeptical if that approach will give you good results.
Given that each model might have converged to another local minimum, I don’t think that e.g. the average of all parameters representing local minima will give you another minimum.

Let us know, how the experiment goes and if you were able to achieve a good performance using this approach.

I’ve done that, but now what I want to try to do is to optimize the linear combination of the models. I’ve been able to do something like that manually (very inefficient), but I wanted to see if there’s a way to do it using Pytorch. The struggle I have is finding a way to have the vector that represents the coefficient for each model in the combination inside the path of the backprop, so I can use Pytorch to tune those coefficients. I’m at a loss there, I don’t know if it’s even possible.

And you are absolutely correct. I think there’s a possibility for it to work, given my previous manual tests, but that’s why I want to keep testing

To illustrate, this would be something like the pseudo model I would like to create.

class Combinator(nn.Module):
def __init__(self, modelA, modelB):
super(Combinator, self).__init__()
#Linear coefficient vector
#[Wa, Wb]
self.linweights = nn.Parameter(torch.ones(2))
#Combined model
#self.model = Wa*modelA + Wb*modelB
def forward(self, x):
x = self.model(x)
return x

The problem I have is with creating that resulting model combining the others and if it makes sense to, for example, setting the models parameters to “requires_grad = False” and wanting only the coefficients I’ve named “linweights” to be trainable and changed by the backprop.

Hello! Sorry to write again, but I haven’t been able to advance much, I was wondering if you may have an idea about what to do.

So far, the only way I’ve found to multiply what I call the “coefficients” to train is by creating a nn.Parameter for each model I want to combine (say, self.linwA = nn.Parameter(abs(torch.randn(1)),requires_grad=True) ) and multiply it by each of the parameters of the model, something like this:

def mulpar(self, coefficient, model):
for k, v in model.state_dict().items():
model.state_dict()[k][:] = coefficient.expand_as(v)*model.state_dict()[k][:]
return model

However, I think it doesn’t work, because the resulting model is no longer trainable (requires_grad = False) and I think that’s causing a “RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn” error later when I want to calculate the loss, there’s nothing to backpropagate.

It does work, I just didn’t automate the process to generate the mirror paths. Might or might not be useful in your case, but I hope at least it gives you an idea

I had the same question, and for my use in 1d CNN, I wrote a custom convolution layer that creates multiple weights and uses F.conv1d to calculate. This approach might be faster than saving and loading state_dict if performance is in your concern.