SWA models are not working

Hi,

I’m trying to get SWA (EMA) working for my model. But whatever I try, the model I get afterwards does not work. Well, it technically works but the predictions are simply wrong.

The way I currently do it is, that I start from my current best model as checkpoint and train for another 20 epochs with a reduced learning rate. And then I use the averaged model.

I train a mixed-precision model using DDP. I think, the way I calculate the model is actually straightforward:

I was wondering if the way I serialize the model could be a problem, because the layer keys in the model_state_dict is a bit different than the layer names in the not averaged model:

not averaged model: ‘conv_layer0.0.weight’

averaged model: ‘module.conv_layer0.0.weight’

Could that be already the problem?

Happy to get any new insights :slight_smile:

Best,
Thorsten

I think I can narrow down the problem a bit.

The training runs with the old “DataParallel” mode.

However, the “prediction” runs with “DistributedDataParallel”. So far this has not been a problem, but it turns out that running the averaged model with DDP gives me corrupted results, while running the same model with DP gives me decent results.

Any idea what might be causing this?

@ptrblck maybe?

Best,
Thorsten

PS: The topic now might be better placed into the “Distributed” category.

Solved. It was just the module. prefix. Removed it and now it works :slight_smile:

1 Like