Creating a model which weights are the sum of weights of 2 different neural networks

Hi Bruno!

I don’t have any specific references. I have a vague recollection that
I expressed similar skepticism in another thread on this forum and the
poster ended up replying that he got his scheme working.

Some thoughts:

First, a disclaimer: I haven’t tried any relevant experiments myself,
and I don’t know of any literature on this issue.

But let’s say you want to try this and make it work – that is, you want
to train two models with identical architectures independently on two
similar, but distinct problems, and then average the weights together
somehow.

Why might doing something like this have value? My intuition runs as
follows: The upstream weights of a network tend to learn “lower-level”
features that are likely to be common to both problems. For example,
upstream convolutions might learn to detect edges. In contrast, the
downstream weights learn “higher-level” features more tuned to the
specific problem being trained on. For example, even though similar,
a downstream fully-connected layer trained on zipcodes might learn
handwritten digits, but one trained on house numbers, might do better
with block-letter-style printed or engraved digits.

By “averaging” together the upstream weights, you might do a better
(and perhaps more generalizeable) job recognizing the more generic,
lower-level features. (It would seem harder to gain any advantage with
the more-problem-specific features.)

Let’s say this is all true and can be made to work somehow. The
concern I expressed above still remains: After independent training,
because of the redundancies in the network weights, there’s no reason
that weight-17 in model A plays the same role or has the same meaning
as weight-17 in model B. So it doesn’t make sense to combine them
together, using an average or otherwise.

However, what if, by (extremely unlikely) happenstance, model A and
model B, while being trained, did end up following similar paths, and
ended up at similar locations in weight-configuration space. Now it
could make sense to combine the two weight-17s together, as they
could now have similar meaning.

One approach could be to have the two models guide one another
along similar paths while training.

The idea would be that you could start the two models at the same
(random) location – i.e., use the same random initialization for weights
of both models – and then add a loss term that nudges the two models
to prefer similar paths.

Concretely, you could use ((weights_A - weights_B)**2).sum()
as an added loss term for training both models, where, when taking
the optimization step for model A, weights_B are viewed as fixed
(non-trainable) parameters, and vice versa.

If you accept the intuition that the upstream weights are more likely
than downstream weights to play similar roles in the two models, you
might choose to weight the upstream weights more heavily in the
proposed added loss term.

Taking this idea to its logical extreme, you could weight the upstream
weights so heavily that the upstream weights of the two models are
forced to be identical, and give zero weight to the downstream weights
so that they train completely independently.

Of course doing this would just be doing what people already do when
they train two-headed networks where there is a common upstream
part to the network, but two downstream heads that make two sets of
predictions for two different problems that feed to different loss functions
that are added together into a single loss that is then backpropagated.

From this perspective using such a (potentially-weighted) loss term to
guide the two models along similar paths could be viewed as a way
to interpolate between training the two models fully independently,
combining the weights together after the fact. and training a single
model with two heads.

Best.

K. Frank

1 Like