How to implement torch.optim.lr_scheduler.CosineAnnealingLR?

Sriharsha-hatwar · April 7, 2020, 7:15pm

Hi @ptrblck thanks for you response. I am unsure of where exactly I looked into(I think this one comment by @k0pch4 : ), but I was a little bit confused by how to implement my solution where , consider a scenario where I want to train for 50 epochs and I want to have 10 cycles of cosine annealing (means 5 epoch it decreases and then we need to maximise the lr again). Here when you say scheduler.step() inside the train batch loader, don’t you think it decreases the lr in a train loader batch iteration and not in each epoch cycle. If it updates in each train loader batch cycle then does it mean when the cosine angle is measured as per the documentation we need to multiply Tmax by the no. of train loader batches? I am pretty confused here. Will be of great help if someone could help me out.

Sriharsha-hatwar · April 7, 2020, 7:17pm

This is the documentation that I am reffering to : https://pytorch.org/docs/stable/_modules/torch/optim/lr_scheduler.html#CosineAnnealingLR

gamebo5000 · July 22, 2020, 2:44pm

What does the solution look like now? when I tried to do this I was told that after version 1.1.0 you have to call optimizer.step() first, so does the mean I called optimizer.step() before the dataloader loop and scheduler in the loop?

saba · November 6, 2020, 12:02am

Hi Ptrblck,

Sorry to take your time. I wanna use the " scheduler = MultiStepLR(optimizer, milestones=[30,80,100,150]) gamma=0.1), I would appreciate if tell me the implementation is correct:

import torch.optim as optim
scheduler= optim.lr_scheduler.MultiStepLR(optimizer, milestones=[30,80,100,150], gamma=0.1)
lr1=0.0002
lr2=0.0002
optimizerD = optim.Adam(netD.parameters(), lr=lr1, betas=(beta1, 0.999))
optimizerG = optim.Adam(netG.parameters(), lr=lr2, betas=(beta1, 0.999))

for epoch in range(num_epochs):
    # (1) Update D network:
    netD.zero_grad()
    scheduler.step()
    netD=netD.float()
    output = netD(real_cpu).view(-1)
    errD_real = criterion(output, label)
    errD_real.backward()

    noise = torch.randn(b_size, nz,1, 1, 1, device=device)    
    # Generate fake image batch with G
    netG=netG.float()
    fake = netG(noise).to(device)
    output = netD(fake.detach()).view(-1)
    errD_fake = criterion(output, label)
    errD_fake.backward()
    # Update D
    optimizerD.step()
    
    # (2) Update G network
    ###########################
    netG.zero_grad()
    label.fill_(real_label) 
    output = netD(fake).view(-1)
    # Calculate G's loss based on this output
    errG = criterion(output, label)
    errG.backward()
    # Update G
    optimizerG.step()

ptrblck · November 6, 2020, 12:32am

Since PyTorch 1.1.0 you are supposed to call scheduler.step() after the optimizer.step() operation, so you should move it at the end of your loop.

Unrelated to the usage of the scheduler, but you don’t need to call net = net.float() in each iteration.
Call this once (if necessary) before starting the loop.

saba · November 6, 2020, 12:38am

Many thanks for your reply.
It means that one time after optimizerD.step() and another time after optimizerG.step() ??

ptrblck · November 6, 2020, 12:42am

You should run it once after the optimizer.step().
Note that the learning rate scheduler works on one optimizer (the one you used while creating the scheduler).
In your current code snippet optimizerD and optimizerG are not using a scheduler.
If you want to change the learning rate of these two optimizers, create two separate scheduler and pass these optimizers to the creation of the schedulers.

saba · November 6, 2020, 12:50am

Is it right now?

import torch.optim as optim

lr1=0.0002
lr2=0.0002
optimizerD = optim.Adam(netD.parameters(), lr=lr1, betas=(beta1, 0.999))
optimizerG = optim.Adam(netG.parameters(), lr=lr2, betas=(beta1, 0.999))
schedulerD= optim.lr_scheduler.MultiStepLR(optimizerD, milestones=[30,80,100,150], gamma=0.1)
schedulerG= optim.lr_scheduler.MultiStepLR(optimizerG, milestones=[30,80,100,150], gamma=0.1)

for epoch in range(num_epochs):
    # (1) Update D network:
    netD.zero_grad()
    netD=netD.float()
    output = netD(real_cpu).view(-1)
    errD_real = criterion(output, label)
    errD_real.backward()

    noise = torch.randn(b_size, nz,1, 1, 1, device=device)    
    # Generate fake image batch with G
    netG=netG.float()
    fake = netG(noise).to(device)
    output = netD(fake.detach()).view(-1)
    errD_fake = criterion(output, label)
    errD_fake.backward()
    # Update D
    optimizerD.step()
    schedulerD.step()

    # (2) Update G network
    ###########################
    netG.zero_grad()
    label.fill_(real_label) 
    output = netD(fake).view(-1)
    # Calculate G's loss based on this output
    errG = criterion(output, label)
    errG.backward()
    # Update G
    optimizerG.step()
    schedulerG.step()

ptrblck · November 6, 2020, 12:52am

Yes, now it looks correct.

saba · November 6, 2020, 12:52am

many many thanks for your help

Brando_Miranda · February 2, 2022, 8:35pm

@ptrblck If I don’t want to restart would the following be the right approach?

model = nn.Linear(10, 2)
optimizer = optim.SGD(model.parameters(), lr=1.)
steps = 10
scheduler = optim.lr_scheduler.CosineAnnealingLR(optimizer, steps)

for epoch in range(5):
    for idx in range(steps):
        opt.step()
    
    scheduler.step()

Brando_Miranda · February 2, 2022, 8:37pm

Is restarting at the end of each epoch the recommended way to restart? Or are there other popular alternatives?

ptrblck · February 2, 2022, 8:51pm

Yes, you approach would be right assuming the range(steps) loop is iterating the DataLoader for an epoch and would thus match the examples given in the docs.

It would depend on your desired learning rate scheduling and when the learning rate should be changed.

Brando_Miranda · February 2, 2022, 8:55pm

I think for me the main take away (e.g. I train meta-learning algorithms) is that “an epoch is not clearly defined”. So if we do iterations in that case then it just depends how often we call the scheduler and how often we want to restart the scheduler and it’s up to us. In a summary:

main take away:

make sure to decay the rate however often you call the scheduler, thus give Tmax = total_steps/log_feq_scheduler (this is simpler in SL, just do every epoch and T_max = num_epochs)
restart the scheduler at the end of T_max. T_max in SL is usualy at the end of an epoch. Do that by creating a new scheulder at that point. Code for that here: How to implement torch.optim.lr_scheduler.CosineAnnealingLR? - #6 by ptrblck

for me I usually do something like 1epoch ~ 2000 steps and "20 epochs " ~ 2k*25. Or something like that.

Thanks!

Aishuvenkat09 · July 25, 2023, 4:13pm

or use COSINEANNEALINGWARMRESTARTS() ?