SWA AveragedModel proper usage

Mario_Parreno · November 5, 2020, 2:37pm

Hi, I want to use SWA technique and finally is official at Pytorch! https://pytorch.org/blog/pytorch-1.6-now-includes-stochastic-weight-averaging/

But there is something not clear. In the example they post, the averagemodel is created out of the initial training loop. My question is… If we want to average from last model weights, as should be and not initial random weights, we should create the averagemodel just before swa starts, something like this (following example nomenclature):

if (epoch+1) == swa_start:
    swa_model = AveragedModel(model)

Mario_Parreno · January 29, 2021, 11:07am

Up… Not found solution Modifying the page example. I think should be like:

from torch.optim.swa_utils import AveragedModel, SWALR
from torch.optim.lr_scheduler import CosineAnnealingLR

loader, optimizer, model, loss_fn = ...
swa_model = None
scheduler = CosineAnnealingLR(optimizer, T_max=100)
swa_start = 5
swa_scheduler = SWALR(optimizer, swa_lr=0.05)

for epoch in range(100):
      for input, target in loader:
          optimizer.zero_grad()
          loss_fn(model(input), target).backward()
          optimizer.step()
      if epoch > swa_start:
          if swa_model is None:
              swa_model = AveragedModel(model)
          else:
              swa_model.update_parameters(model)
              swa_scheduler.step()
      else:
          scheduler.step()

# Update bn statistics for the swa_model at the end
torch.optim.swa_utils.update_bn(loader, swa_model)
# Use swa_model to make predictions on test data 
preds = swa_model(test_input)

jmaronas · January 29, 2021, 11:12am

Perhaps @ptrblck can help about this?

ptrblck · January 29, 2021, 11:18am

I think your code is correct and the initial “checkpoint” would be created after swa_start epochs were already done. Afterwards the update would take place.
I’m unsure, if this is “necessary” or if the posted example would also work fine, since the initial updates would become less important during the training.

Did you run your code and saw a difference using both approaches?
If so, which one performed better?

DDavid · February 12, 2021, 1:50pm

I did like what you did, and it seems you are right.
I checked right after the swa_model = AveragedModel(model), the weights of the swa_model parameters and model parameters are exactly the same. Doing swa_model = AveragedModel(model) before the loop starts is not a good idea.

hktxt · June 15, 2021, 8:01am

I wonder how to insert validation phase code?

mikolchon · August 6, 2021, 3:35pm

@Mario_Parreno @ptrblck
Do we also need to update the BN stats before validation?
Given that we do need it for testing, I suspect we also need it for validation, correct?

ptrblck · August 8, 2021, 6:29am

The batchnorm stats will be updated by default during training (the model is by default in training mode or you can additionally call model.train()) while these running stats will be used during validation after calling model.eval(). I’m unsure how you would like to update these stats so could you explain the use case and question a bit more?

mikolchon · August 8, 2021, 10:54am

Sure! So in SWA two models are maintained: the model and the swa_model. The latter is the averaged model. What we “train” is model (we backprop this, update its weights, etc.), and only every now and then we update swa_model with model by averaging. That’s why the batchnorm stats in swa_model needs separate updating. From the Pytorch website:

One important detail is the batch normalization. Batch normalization layers compute running statistics of activations during training. Note that the SWA averages of the weights are never used to make predictions during training. So the batch normalization layers do not have the activation statistics computed at the end of training. We can compute these statistics by doing a single forward pass on the train data with the SWA model.

And here’s the code snippet provided by Pytorch, where we see BN stats of swa_model being updated at the end right before testing:

for epoch in range(100):
      for input, target in loader:
          optimizer.zero_grad()
          loss_fn(model(input), target).backward()
          optimizer.step()
      if epoch > swa_start:
          swa_model.update_parameters(model)
          swa_scheduler.step()
      else:
          scheduler.step()

# Update bn statistics for the swa_model at the end
torch.optim.swa_utils.update_bn(loader, swa_model)
# Use swa_model to make predictions on test data 
preds = swa_model(test_input)

The code snippet doesn’t show the validation step. So my question was whether we should also update_bn() the swa_model for validating it.

ptrblck · August 8, 2021, 10:04pm

Thanks for clarifying the question!
Using the provided code snippet to update the batchnorm stats using the training dataset sounds right. I wouldn’t update the stats using the validation dataset, as I would consider it a data leak (similar to updating them without the SWA util).

mikolchon · August 9, 2021, 11:10am

Yes, I wouldn’t use the validation dataset. What I meant is to update the BN stats using the training set before running validation, similar to the code example above (loader is the training set).

Ok so looking at swa/train.py at 411b2fcad59bec60c6c9eb1eb19ab906540e5ea2 · timgaripov/swa · GitHub I think I was right in that we need to update the BN stats of the swa_model before we validate. Relevant lines 158, 159 @hktxt @ptrblck

chenglu · January 24, 2023, 8:15am

I was wandering if we update bn with the validation set, will the performance be better? If true, should we also apply the update_bn at the test set?

Fredrik_Opeide · February 8, 2024, 3:25pm

So if we want to compute validation metrics after each training epoch, we need to do an entire second pass through our training data? After every epoch? That can add up to a lot of extra time. Are there any recommended alternative methods?