My training code runs on 2 GPU in DDP set-up , Each GPU handles a batch of 128.
training_steps = Overall_data / (2 GPU*128) = 5453 steps
warmup_steps = 545
def lr_lambda(current_step: int):
-----if current_step < num_warmup_steps:
-----return float(current_step) / float(max(1, num_warmup_steps))
----- return max(
0.0, float(num_training_steps - current_step) / float(max(1, num_training_steps - ------------
As per as the above calculation, learning rate which I should have got @x step I’m getting @x/2 step when I see the plot in tensorboard using scheduler.get_last_lr().
Any reason for this behavior ?
This seems unexpected, it could possibly be an issue with how data is split in your script or the accounting for the # of steps - could you please attach a repro of your issue (maybe instead of tensorboard, just print(scheduler.get_last_lr()) and compare it to what you expect)?
Also, is the result of
scheduler.get_last_lr() what you expect if you turn off distributed training and train on local only?
Unfortunately , I will not be able to attach my code repo
Data split :
sampler = torch.utils.data.distributed.DistributedSampler(ip_data)
dataloader = DataLoader(ip_data, batch_size=batch_size, sampler=sampler,persistent_workers=True,num_workers=16)
#ip_data is (X,y)
num_training_steps = int(epochs (len(train_loader)/dist.get_world_size()))
scheduler = get_scheduler(“linear”,optimizer=optimizer,num_warmup_steps=int(0.1(len(train_loader)/dist.get_world_size())),num_training_steps=num_training_steps)
#get_schedule is from huggingface
scheduler.get_last_lr() → this is how I stored values in tensorboard
[ScalarEvent(wall_time=1656084727.27605, step=100, value=3.676470441860147e-05), ScalarEvent(wall_time=1656084836.8856416, step=200, value=7.352940883720294e-05), ScalarEvent(wall_time=1656084946.7749376, step=300, value=9.989627142203972e-05),
If you call the fn “lr_lambda” which I have provided above
lr_lambda(200) * 0.0001 → 3.669724770642202e-05 which is equivalent of step 100
I will try to run it with single GPU,
Thanks for the response