lr_scheduler.OneCycleLR causing 'Tried to step 57082 times. The specified number of total steps is 57080'

I’m using lr_scheduler on 28540 training data. For an experiment I’m training the data twice and testing it once per epoch.

optimizer = optim.AdamW(model.parameters(), hparams['learning_rate'])
    criterion = nn.CTCLoss(blank=28).to(device)
    scheduler = optim.lr_scheduler.OneCycleLR(optimizer, max_lr=hparams['learning_rate'],
                                              steps_per_epoch=int(len(train_loader)),
                                              epochs=hparams['epochs'],
                                              anneal_strategy='linear')

for epoch in range(0, epochs):        
        train(model, device, train_loader, criterion, optimizer, scheduler, text_transform, epoch)
        FeaScatter_Train(model, device, train_loader, criterion, optimizer, scheduler, 0.008,1,0.008, text_transform, epoch)
        best_wer = test(model, device, test_loader, criterion, epoch, text_transform, 'standard_')

In side the train method I used the scheduler as follows for the both the method. Here, I am showing only one of the train method

model.train()
    train_loss = 0
    iterator = tqdm(train_loader)
    
    for batch_idx, _data in enumerate(iterator):
        spectrograms, labels, input_lengths, label_lengths = _data 
        spectrograms, labels, model = spectrograms.to(device), labels.to(device), model.to(device)

        optimizer.zero_grad()

        output = model(spectrograms)  # (batch, time, n_class)
        output = F.log_softmax(output, dim=2)
        output = output.transpose(0, 1) # (time, batch, n_class)
        
        loss = criterion(output, labels, input_lengths, label_lengths).to(device)
        loss.backward()
        
        optimizer.step()
        scheduler.step()
        train_loss += loss.item()

I can see there is a relation between the training sample and the scheduler but I can’t seem to solve it.

I cannot reproduce this issue inside the training loop:

optimizer = torch.optim.SGD([torch.randn(1, requires_grad=True)], lr=1.)
train_loader = DataLoader(torch.randn(100, 1))

scheduler = optim.lr_scheduler.OneCycleLR(optimizer, max_lr=1.,
                                          steps_per_epoch=int(len(train_loader)),
                                          epochs=1,
                                          anneal_strategy='linear')

for data in train_loader:
    optimizer.step()
    scheduler.step()

scheduler.step() # raises error

Coud epochs be smaller than the hparams['epochs'], which is passed to the epochs argument in the creation of the scheduler?
In my code snippet the error after the training loop is expected, since I used epochs=1.

Same problem here.First several epochs are OK, but then at some point this bug will show up - also with 2 more steps:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-31-72b05a339c71> in <module>
      2 train_loop(
      3     manager, settings, model, device,
----> 4     train_loader, optimizer, scheduler, loss_func)

<ipython-input-22-8964fda4c572> in train_loop(manager, args, model, device, train_loader, optimizer, scheduler, loss_func)
     15                 loss.backward()
     16                 optimizer.step()
---> 17                 scheduler.step()
     18 
     19 def eval_for_batch(

/opt/conda/lib/python3.7/site-packages/torch/optim/lr_scheduler.py in step(self, epoch)
    139             if epoch is None:
    140                 self.last_epoch += 1
--> 141                 values = self.get_lr()
    142             else:
    143                 warnings.warn(EPOCH_DEPRECATION_WARNING, UserWarning)

/opt/conda/lib/python3.7/site-packages/torch/optim/lr_scheduler.py in get_lr(self)
   1211         if step_num > self.total_steps:
   1212             raise ValueError("Tried to step {} times. The specified number of total steps is {}"
-> 1213                              .format(step_num + 1, self.total_steps))
   1214 
   1215         for group in self.optimizer.param_groups:

ValueError: Tried to step 2102 times. The specified number of total steps is 2100

Each epoch contains 42 iterations, no exception. I’ve write my settings as below:

loader:
  train:
    batch_size: 50
    shuffle: True
    num_workers: 2
    pin_memory: True
    drop_last: True
  val:
    batch_size: 100
    shuffle: False
    num_workers: 2
    pin_memory: True
    drop_last: False

scheduler:
  name: OneCycleLR
  params:
    max_lr: 0.01
    steps_per_epoch: 42
    epochs: 50

And below is my training loop:

def train_loop(
    manager, args, model, device,
    train_loader, optimizer, scheduler, loss_func
):
    """Run minibatch training loop"""
    while not manager.stop_trigger:
        model.train()
        for batch_idx, (data, target) in enumerate(train_loader):
            with manager.run_iteration():
                data, target = data.to(device), target.to(device)
                optimizer.zero_grad()
                output = model(data)
                loss = loss_func(output, target)
                ppe.reporting.report({'train/loss': loss.item()})
                loss.backward()
                optimizer.step()
                scheduler.step()

After all during the 50 epochs a ValueError such like this could occur and interupt training. Don’t know how to solve this problem since if the steps are too less then it should be 2150, 2200 or 2250 etc, why 2102? Really can’t understand…

The error message claims that scheduler.step() was called more than the specified 42*50=2100 times.
I don’t see the epoch loop in your code snippet, but assume it’s somehow integrated in the manager?
For debugging purposes you could add a counter and print its accumulated value for each scheduler.step() call.

Hi, I was able to solve my problem by changing my steps_per_epoch in the OneCycleLR to 2*(len(train_loader)).

The default configuration of OneCycleLR is set for taking only one step in the training stage per epoch. But since I was training twice per epoch hence it stopped after going through half the times.

I am not sure if you are training multiple times per epoch for adversarial training or something else. You can check into that. Also you need to check how many times scheduler.step() is executed per epoch. If it is excuting more than once per epoch you need to change your step_per_epoch.

1 Like

when you specify steps_per_epoch=len(train_dataloader), I think it implicitly calculates the total number of steps you will require to perform the training using epoch total_steps = steps_per_epoch*epochs
Now if you retrain it again for some more epoch without reinstalling it will show this error.
SOLUTION:
Reinitialize the epoch parameter in the schedular:
scheduler = OneCycleLR(optimizer, max_lr=0.01, steps_per_epoch=391, epochs=epochs) # steps_per_epoch = len(train_loader)

Just in case if anyone will have the same error after you load state dicts to re-train/continue to train model, you need to update total_steps value in scheduler state dict before you load it:

scheduler = OneCycleLR(optimizer,...steps_per_epoch=int(len(train_loader)),epochs=args.epochs)
sch_dict = checkpoint['scheduler']
sch_dict['total_steps'] = sch_dict['total_steps'] + args.epochs * int(len(train_loader))
scheduler.load_state_dict(sch_dict)

I think it’s related to only OneCycleLR.

3 Likes

Ran into a similar error during one of my experiments. After some digging, noticed that I updated the number of epochs in the training for-loop, but forgot to re-instantiate the scheduler with the new NUM_EPOCHS (which was being defined in another notebook cell earlier).

Instantiating the scheduler with the updated number of epochs, fixed the issue for me.

Here’s how I am setting the scheduler and the training loop with the NUM_EPOCHS

NUM_EPOCHS = 5

# set the scheduler with OneCycleLR
scheduler = torch.optim.lr_scheduler.OneCycleLR(optimizer=optim,max_lr=0.01,steps_per_epoch=len(train_loader), epochs=NUM_EPOCHS)

# training loop
for epoch in range(NUM_EPOCHS):
    model.train() # set model in training mode
    ....

THANK YOU @Etoye your answer saved my day !