Gradient accumulation and scheduler

Upgrade_Yourself · February 8, 2020, 8:32pm

I’m using gradient accumulation and torch.optim.lr_scheduler.CyclicLR.
Is there a special thing to consider when using gradient accumulation in this case?
scheduler.step() should come after optimizer.step(). However because of gradient accumulation, this may not be the case.
Thank you

danitamm · February 10, 2020, 3:39pm

Hello,
I was wondering the same thing with respect to huggingface transformers’ scheduler. I found that one of this library’s examples addresses this in the scheduler constructor by dividing the “pre-accumulation” number of steps by gradient_accumulation_steps:

t_total = len(train_dataloader) // args.gradient_accumulation_steps * args.num_train_epochs
scheduler = get_linear_schedule_with_warmup(
        optimizer, num_warmup_steps=args.warmup_steps, num_training_steps=t_total
)

They then call scheduler.step() immediately after each optimizer.step().

In your case with CyclicLR, I think you’ll want to divide your original step_size_up and step_size_down by gradient_accumulation_steps.

Upgrade_Yourself · February 10, 2020, 6:12pm

Thank you! What you said is reasonable and matches what’s implemented in lr_finder fast_ai package.They basically train a number of batches equal to the number of gradient accumulation steps. Then they do scheduler.step().