I just confirmed my understanding related to T_0 argument.
Let’s say I have 500 epochs and data loader len 97
loader_data_size = 97 for epoch in epochs: self.state.epoch = epoch # in my case it different place so I track epoch in state. for batch_idx, batch in enumerate(self._train_loader): # I took same calculation from example. next_step = self.state.epoch + batch_idx / loader_data_size scheduler.step(next_step)
It will anneal every batch at the end of the batch ( since I set T_0 = 97, it will restart back) if I understand semantics. If that is the case, it implies that every example in batch ( the last example will always get the lowest LR). If my understanding is correct, do I need to shift the cosine function somehow and slide it to fix the behavior? or it will slide by itself. i.e for epoch 0 lower LR at batch_idx = 96 at epoch=1 at batch_idx = 96 LR going to be the same at in epoch=0 or it will slide ?