Average each weight of two models


I have two models trained exactly the same way just with a different learning rate now I would like to average each weight of every single layer in the model an create a new one with the weight averages. Is that possible ?

Thanks in advance

Yes, you could get the state_dicts of both models, average the parameters and reload the new state_dict.
Here is a small dummy example:

# Setup
modelA = nn.Linear(1, 1)
modelB = nn.Linear(1, 1)

sdA = modelA.state_dict()
sdB = modelB.state_dict()

# Average all parameters
for key in sdA:
    sdB[key] = (sdB[key] + sdA[key]) / 2.

# Recreate model and load averaged state_dict (or use modelA/B)
model = nn.Linear(1, 1)

Thanks that works. Thank you so much

Hello @ptrblck,

Is there a proper way of parallelizing the loop over the keys of the state_dict ?

Thanks in advance

I guess Python might provide some multiprocessing utilities for this use case. However, how often are you using this code snippet, as it seems it’s the bottleneck you would like to optimize?

1 Like

@ptrblck Slightly off-topic: Is there any solid evidence that averaging the weights of different runs has any benefits?

In principle, both runs may converge two very different minima w.r.t. the the loss. It’s not obvious to my, why averaging the weights would be a good idea. Admittedly, this might depend what “trained exactly the same way” means, but even a different order of batches might effect this, wouldn’t it?


I use this averaging after each optimizer step, I agree that using multiprocessing utilities from python can be a good idea, however in general this kind of utilities creates jobs pool on the cpu, then execute them, and I am not sure if this the best possible way of doing it. I am wondering if there is a possibility to create the jobs pool on the gpu directly. I believe that it won’t make any significant difference, but I am not sure about it.

At some point I’ve looked into Stochastic Weight Averaging, which claims that a simple averaging of multiple checkpoints leads to a better generalization.
If I’m not mistaken, there is also an SWA package for PyTorch, which applies this strategy during your training.
That being said, I don’t know if these experiments still hold true for other models than the one mentioned in the paper.
Generally, I would assume that this method would only work, if all training checkpoints are close to the global minimum, which seems to be the case if you use the checkpoints from a single run (if I remember the paper correctly).

@omarfoq I think this operation might benefit from the current ongoing port of apex' multi_tensor_apply in this PR.
You could have a look at the code changes and check, if you could reuse it.
However, as mentioned before, I would profile the code beforehand and make sure that a speedup would be visible in your case.


@ptrblck thanks, good points. Yes, when the weights come from the same run at different checkpoints, I can see that this might be beneficial.