Loss jumps abruptly whenever learning rate is decayed in Adam optimizer

I’m training an auto-encoder network with Adam optimizer (with amsgrad=True) and MSE loss for Single channel Audio Source Separation task. Whenever I decay the learning rate by a factor, the network loss jumps abruptly and then decreases until the next decay in learning rate.

I’m using Pytorch for network implementation and training.

Following are my experimental setups:

 Setup-1: NO learning rate decay, and 
          Using the same Adam optimizer for all epochs

 Setup-2: NO learning rate decay, and 
          Creating a new Adam optimizer with same initial values every epoch

 Setup-3: 0.25 decay in learning rate every 25 epochs, and
          Creating a new Adam optimizer every epoch

 Setup-4: 0.25 decay in learning rate every 25 epochs, and
          NOT creating a new Adam optimizer every time rather
          using PyTorch's multiStepLR decay scheduler every 25 epochs

I am getting very surprising results for setups #2, #3, #4 and am unable to reason any explanation for it. Following are my results:

Setup-1 Results:

Here I'm NOT decaying the learning rate and 
I'm using the same Adam optimizer. So my results are as expected.
My loss decreases with more epochs.
Below is the loss plot this setup.

Setup-1 Results

optimizer = torch.optim.Adam(lr=m_lr,amsgrad=True, ...........)

for epoch in range(num_epochs):
    running_loss = 0.0
    for i in range(num_train):
        train_input_tensor = ..........                    
        train_label_tensor = ..........
        optimizer.zero_grad()
        pred_label_tensor = model(train_input_tensor)
        loss = criterion(pred_label_tensor, train_label_tensor)
        loss.backward()
        optimizer.step()
        running_loss += loss.item()
    loss_history[m_lr].append(running_loss/num_train)

Setup-2 Results:  

Here I'm NOT decaying the learning rate but every epoch I'm creating a new
Adam optimizer with the same initial parameters.
Here also results show similar behavior as Setup-1.

Because at every epoch a new Adam optimizer is created, so the calculated gradients
for each parameter should be lost, but it seems that this doesnot affect the 
network learning. Can anyone please help on this?

enter image description here

for epoch in range(num_epochs):
    optimizer = torch.optim.Adam(lr=m_lr,amsgrad=True, ...........)

    running_loss = 0.0
    for i in range(num_train):
        train_input_tensor = ..........                    
        train_label_tensor = ..........
        optimizer.zero_grad()
        pred_label_tensor = model(train_input_tensor)
        loss = criterion(pred_label_tensor, train_label_tensor)
        loss.backward()
        optimizer.step()
        running_loss += loss.item()
    loss_history[m_lr].append(running_loss/num_train)

Setup-3 Results: 

As can be seen from the results in below plot, 
my loss jumps every time I decay the learning rate. This is a weird behavior.

If it was happening due to the fact that I'm creating a new Adam 
optimizer every epoch then, it should have happened in Setup #1, #2 as well.
And if it is happening due to the creation of a new Adam optimizer with a new 
learning rate (alpha) every 25 epochs, then the results of Setup #4 below also 
denies such correlation.

enter image description here

decay_rate = 0.25
for epoch in range(num_epochs):
    optimizer = torch.optim.Adam(lr=m_lr,amsgrad=True, ...........)
    
    if epoch % 25 == 0  and epoch != 0:
        lr *= decay_rate   # decay the learning rate

    running_loss = 0.0
    for i in range(num_train):
        train_input_tensor = ..........                    
        train_label_tensor = ..........
        optimizer.zero_grad()
        pred_label_tensor = model(train_input_tensor)
        loss = criterion(pred_label_tensor, train_label_tensor)
        loss.backward()
        optimizer.step()
        running_loss += loss.item()
    loss_history[m_lr].append(running_loss/num_train)

Setup-4 Results:  

In this setup, I'm using Pytorch's learning-rate-decay scheduler (multiStepLR)
which decays the learning rate every 25 epochs by 0.25.
Here also, the loss jumps everytime the learning rate is decayed.

I'm not understanding the reason behind this behaviour.

enter image description here

scheduler = torch.optim.lr_scheduler.MultiStepLR(optimizer=optimizer, milestones=[25,50,75], gamma=0.25)
optimizer = torch.optim.Adam(lr=m_lr,amsgrad=True, ...........)

for epoch in range(num_epochs):

    scheduler.step()

    running_loss = 0.0
    for i in range(num_train):
        train_input_tensor = ..........                    
        train_label_tensor = ..........
        optimizer.zero_grad()
        pred_label_tensor = model(train_input_tensor)
        loss = criterion(pred_label_tensor, train_label_tensor)
        loss.backward()
        optimizer.step()
        running_loss += loss.item()
    loss_history[m_lr].append(running_loss/num_train)

I’m not able to understand the reason for the sudden jumps in the loss whenever I decay the learning rate.

EDIT: As suggested in the comments and reply below, I’ve made changes to my code and trained the model. I’ve added the code and plots for the same.

Any help.
Thanks

Hi, I do not fully understand the problem, too. However here are some thoughts on your problem:

  • Your loss decays without explicit learning rate decay. Is there a particular reason you want to get learning rate decay working?
  • Adam uses adaptive learning rates intrinsically. I guess for many problems that should be good enough. You can read more on this in this discussion on Stackoverflow.
  • Adam (like many other common optimization algorithms) adapts to a specific machine learning problem by computing/estimating momenta. Creating a new optimizer every epoch therefor should degrade performance due to loss of the information
  • I feel like decreasing the learning rate by 75 % might be too high when using a momentum based optimizer. Would be interesting, if reducing the learning rate by something like 15–25 % gives better results.

Hi @Florian_1990,
As stated in the original Adam paper, Adam calculates learningrate for each parameter and updates it with each iteration. So inherently I should NOT be able to use learning rate scheduler lr_scheduler.MultiStepLR() on optim.adam() with a new learning rate.

My doubts:

  1. In Setup-3 I’m intentionally creating a new Adam optimizer every 25 epochs but in Setup-4, I’m using available lr_scheduler.MultiStepLR() to decay learning rate every 25 epochs. And surprisingly the behavior in both these settings are same, which I’m not able to understand.

  2. Normally PyTorch should throw a warning/error when using lr_scheduler on optim.adam() stating that external lr_decay creates inconsistent behavior or something like this.

My goal is to learn faster when the loss starts saturating [using Adam] and decaying learning rate is one of the solutions usually.

Thanks

Hi,

Any help please.
Any insight about what I’m doing wrong would be very helpful.

Thanks

I think lr decay was not applied in setting 3 and the accumulated momenta weren’t used.

Also, setup1 worked well w/o lr decay, I mean for me.
So I think it’s too early to decay lr.

This is totally based on my intuition and may include a bunch of mistakes.
Hope help you.

I agree. Thats why I took care of externally decaying learningrate and using the accumulated momentum in Setup-4. But, I still donot understand the results of Setup-4.

Yes the learning curve of setup 4 looks weird for me.
But I suspect the timing of LR decay because setup 1 seems very good without LR decay, I mean, there are no plateau and hints of overfitting.

So how about training your model longer e.g. 100 or 200 epochs then applying LR decay?

I trained on longer epochs and I got below results. The network doesnot find any loss plateau in Setup-1; but the time consumed to run so many epochs takes lots of time.

I ran 200 and 999 epochs and the results are in below graphs. 999 epochs took me 7 days on Quadro P4000.

In pursuit of faster convergence, I am doing a learning rate decay.
In the processof this, I am confused by the behaviour of lr_scheduler on Adam optimizer. I’m curious if there is any implementation fault with lr_scheduler for Adam. Should optim.Adam allow lr_scheduler?

Oh, sorry to hear that.

IMO, to faster convergence, use larger initial learning rate then decay LR.
For example, 1e-3 as initial LR.