As per as the above calculation, learning rate which I should have got @x step I’m getting @x/2 step when I see the plot in tensorboard using scheduler.get_last_lr()[0].
This seems unexpected, it could possibly be an issue with how data is split in your script or the accounting for the # of steps - could you please attach a repro of your issue (maybe instead of tensorboard, just print(scheduler.get_last_lr()[0]) and compare it to what you expect)?
Also, is the result of scheduler.get_last_lr()[0] what you expect if you turn off distributed training and train on local only?
Unfortunately , I will not be able to attach my code repo
Data split :
sampler = torch.utils.data.distributed.DistributedSampler(ip_data)
dataloader = DataLoader(ip_data, batch_size=batch_size, sampler=sampler,persistent_workers=True,num_workers=16) #ip_data is (X,y)
Scheduler :
num_training_steps = int(epochs (len(train_loader)/dist.get_world_size()))
scheduler = get_scheduler(“linear”,optimizer=optimizer,num_warmup_steps=int(0.1(len(train_loader)/dist.get_world_size())),num_training_steps=num_training_steps) #get_schedule is from huggingface
scheduler.get_last_lr()[0] → this is how I stored values in tensorboard
[ScalarEvent(wall_time=1656084727.27605, step=100, value=3.676470441860147e-05), ScalarEvent(wall_time=1656084836.8856416, step=200, value=7.352940883720294e-05), ScalarEvent(wall_time=1656084946.7749376, step=300, value=9.989627142203972e-05),
If you call the fn “lr_lambda” which I have provided above
lr_lambda(200) * 0.0001 → 3.669724770642202e-05 which is equivalent of step 100
I will try to run it with single GPU,
Thanks for the response