SWA training question


I’m trying to use the SWA code in Pytorch 1.6 (https://pytorch.org/blog/pytorch-1.6-now-includes-stochastic-weight-averaging/) and I’m following the sample structure similar to the one in the blog:

from torch.optim.swa_utils import AveragedModel, SWALR
from torch.optim.lr_scheduler import CosineAnnealingLR

loader, optimizer, model, loss_fn = ...
swa_model = AveragedModel(model)
scheduler = CosineAnnealingLR(optimizer, T_max=100)
swa_start = 5
swa_scheduler = SWALR(optimizer, swa_lr=0.05)

for epoch in range(100):
      for input, target in loader:
          loss_fn(model(input), target).backward()
      if epoch > swa_start:

# Update bn statistics for the swa_model at the end
torch.optim.swa_utils.update_bn(loader, swa_model)

However, I’m not certain about a couple of things. After entering the SWA regime, if the SWA scheduler learning rate is the default (0.05), the model in training becomes unstable quickly (NaN), if I lower it (to 0.001), it appears to work (at least no NaN).

Should this be the case that the SWA copy of the model is affecting the model in training? (I had understood that the SWA copy should be updated separatedly from the model in training).

I would also assume that the SWALR object is not interfering with the standard training routine and it seems an additional swa_lr entry is created in the para_groups as seen here.
However, I’m currently unsure where this swa_lr is used, since the model update doesn’t seem to use here.

Hello @ptrblck! I’m still unsure about the logic, but I’ll continue testing. Cheers!