Adam optimizer is said to use adaptive learning rates for each parameter. We provide an initial learning rate when instantiating it.
optimizer = optim.Adam(params, lr=4e-6) scheduler = lr_scheduler.StepLR(optimizer, step_size=10, gamma=0.5) for epoch in range(100): # ... scheduler.step()
I wonder what exactly happens when we call scheduler step. I “guess”, for the above example, it reduces the learning rates of each and every parameter by half. Is this right?
Alternative would be to halve the global learning rate (4e-6), resetting per parameter learning rates and re-estimating them.
Which one is used in PyTorch?
I usually observe a drop in validation accuracy right after scheduler.step takes affect, when the network is close to convergence (it may also be happening at the beginning, but not noticeable due to large leaps in accuracy).