Multiply a model by trainable scalar

Hello,

Is it possible to multiply a trained model by a single scalar that is trainable and backpropagate the loss to that scalar?

I want to do this for two networks, so the idea is to train those two scalars to find how to ideally combine the two models’ weights (not the outputs).

Thanks in advance for any guidance.

1 Like

Hello Victor!

Yes. In general you can train additional parameters (e.g., things that
aren’t weights in Linears) by adding them as Parameters to your
model (which should be a torch.nn.Module) or to your Optimizer.
Pytorch will do the rest.

I’m not sure exactly what you have in mind here. But here is one idea
that might be relevant to your use case:

You have model A and model B already trained. Freeze their layers
(.requires_grad = False). Create an Optimizer that contains your
weight parameter as a single trainable parameter:

weight_param = torch.FloatTensor ([1.234])
optim = torch.optim.SGD ([weight_param])

(where 1.234 is your initial value of weight_param).

Then do something like:

xA = layer1A (input)
xB = layer1B (input)
x = weight_param * xA + (1.0 - weight_param) * xB
x = activation1 (x)
xA = layer1A (x)
xB = layer1B (x)
x = weight_param * xA + (1.0 - weight_param) * xB
x = activation2 (x)
...
loss = criterion (x, target)
optim.zero_grad()
loss.backward()
optim.step()

(This code can be rearranged in various ways, but this is the basic
idea.)

Is this the line along which you were thinking?

Best.

K. Frank

Hello Frank! Thank you very much for your reply!

So, the overall idea is that I want to combine two models that were fine-tuned independently into a single one by interpolating their weights, but I would like to do this using the PyTorch framework (and, most importantly, the loss functions and autograd).

Originally, I was thinking about a single “coefficient” for each model that could just multiply all of the model’s trained weight, but I think I understand your code and it makes sense to me!

If I understand correctly, it would be adding an activation after every layer of the original models that combines the corresponding layer of the two original models, is that right? What could be a good activation to test with in this case?

This sounds great!

Hi Victor!

I was imagining that you had two models with the same architecture,
that is, that all of the layers and activations match up. The only
difference is that they have been trained (or fine tuned) differently,
so that the trained weights differ.

In this case, model A’s activation1 and model B’s activation1
are the same, so I was thinking that you would simply use
activation1 after interpolating the results of layer1A and layer1B.
(Note, the result of interpolating these two results is the same as
what you would get by interpolating the linear layers layer1A and
layer1B, and then applying the interpolated layer to the upstream
input.)

(As an aside, I’m not convinced your scheme will work. If model A
and model B are trained independently to do the same thing, the
values of the internal “hidden” neurons may still be very different,
and the layer weights may be very different, so averaging them might
not make sense. If one model was first trained, and then fine tuned
separately to get models A and B, the weights in A and B might line
up sensibly, but I could imagine that averaging the models together
could simply serve to average away the results of the fine tunings.)

Good luck.

K. Frank

Hello Frank!

Yes, that’s exactly the case!

I understand now! Thanks for explaining it to me, it makes sense now. I’ll try that today to see how it goes, but it really sounds like what I wanted to test.

And you are exactly correct. The models line up (the filters are correlated) and I’ve tested doing this combination manually and it works, but I wanted to test using Pytorch’s automatic differentiation to optimize the search automatically and see how it behaves using some custom loss functions as well, so I think your suggestion will work perfectly!

Thanks again!!

“I’m making a note here: huge success.”

I had to write 1875 lines of code just to disassemble and reassemble everything as I wanted and test it step by step, but it worked :smile:

Thanks, again!