I’m trying to use the SWA code in Pytorch 1.6 (https://pytorch.org/blog/pytorch-1.6-now-includes-stochastic-weight-averaging/) and I’m following the sample structure similar to the one in the blog:
from torch.optim.swa_utils import AveragedModel, SWALR from torch.optim.lr_scheduler import CosineAnnealingLR loader, optimizer, model, loss_fn = ... swa_model = AveragedModel(model) scheduler = CosineAnnealingLR(optimizer, T_max=100) swa_start = 5 swa_scheduler = SWALR(optimizer, swa_lr=0.05) for epoch in range(100): for input, target in loader: optimizer.zero_grad() loss_fn(model(input), target).backward() optimizer.step() if epoch > swa_start: swa_model.update_parameters(model) swa_scheduler.step() else: scheduler.step() # Update bn statistics for the swa_model at the end torch.optim.swa_utils.update_bn(loader, swa_model)
However, I’m not certain about a couple of things. After entering the SWA regime, if the SWA scheduler learning rate is the default (0.05), the model in training becomes unstable quickly (NaN), if I lower it (to 0.001), it appears to work (at least no NaN).
Should this be the case that the SWA copy of the model is affecting the model in training? (I had understood that the SWA copy should be updated separatedly from the model in training).