Is it possible to multiply a trained model by a single scalar that is trainable and backpropagate the loss to that scalar?
I want to do this for two networks, so the idea is to train those two scalars to find how to ideally combine the two models’ weights (not the outputs).
Yes. In general you can train additional parameters (e.g., things that
aren’t weights in Linears) by adding them as Parameters to your
model (which should be a torch.nn.Module) or to your Optimizer.
Pytorch will do the rest.
I’m not sure exactly what you have in mind here. But here is one idea
that might be relevant to your use case:
You have model A and model B already trained. Freeze their layers
(.requires_grad = False). Create an Optimizer that contains your
weight parameter as a single trainable parameter:
So, the overall idea is that I want to combine two models that were fine-tuned independently into a single one by interpolating their weights, but I would like to do this using the PyTorch framework (and, most importantly, the loss functions and autograd).
Originally, I was thinking about a single “coefficient” for each model that could just multiply all of the model’s trained weight, but I think I understand your code and it makes sense to me!
If I understand correctly, it would be adding an activation after every layer of the original models that combines the corresponding layer of the two original models, is that right? What could be a good activation to test with in this case?
I was imagining that you had two models with the same architecture,
that is, that all of the layers and activations match up. The only
difference is that they have been trained (or fine tuned) differently,
so that the trained weights differ.
In this case, model A’s activation1 and model B’s activation1
are the same, so I was thinking that you would simply use activation1 after interpolating the results of layer1A and layer1B.
(Note, the result of interpolating these two results is the same as
what you would get by interpolating the linear layers layer1A and layer1B, and then applying the interpolated layer to the upstream
input.)
(As an aside, I’m not convinced your scheme will work. If model A
and model B are trained independently to do the same thing, the
values of the internal “hidden” neurons may still be very different,
and the layer weights may be very different, so averaging them might
not make sense. If one model was first trained, and then fine tuned
separately to get models A and B, the weights in A and B might line
up sensibly, but I could imagine that averaging the models together
could simply serve to average away the results of the fine tunings.)
I understand now! Thanks for explaining it to me, it makes sense now. I’ll try that today to see how it goes, but it really sounds like what I wanted to test.
And you are exactly correct. The models line up (the filters are correlated) and I’ve tested doing this combination manually and it works, but I wanted to test using Pytorch’s automatic differentiation to optimize the search automatically and see how it behaves using some custom loss functions as well, so I think your suggestion will work perfectly!