How to handle learning rate scheduler in DDP

My training code runs on 2 GPU in DDP set-up , Each GPU handles a batch of 128.

training_steps = Overall_data / (2 GPU*128) = 5453 steps
warmup_steps = 545

def lr_lambda(current_step: int):
-----if current_step < num_warmup_steps:
-----return float(current_step) / float(max(1, num_warmup_steps))
----- return max(
0.0, float(num_training_steps - current_step) / float(max(1, num_training_steps - ------------
------ num_warmup_steps))
)

As per as the above calculation, learning rate which I should have got @x step I’m getting @x/2 step when I see the plot in tensorboard using scheduler.get_last_lr()[0].

Any reason for this behavior ?

This seems unexpected, it could possibly be an issue with how data is split in your script or the accounting for the # of steps - could you please attach a repro of your issue (maybe instead of tensorboard, just print(scheduler.get_last_lr()[0]) and compare it to what you expect)?

Also, is the result of scheduler.get_last_lr()[0] what you expect if you turn off distributed training and train on local only?

Unfortunately , I will not be able to attach my code repo

  1. Data split :
    sampler = torch.utils.data.distributed.DistributedSampler(ip_data)
    dataloader = DataLoader(ip_data, batch_size=batch_size, sampler=sampler,persistent_workers=True,num_workers=16)
    #ip_data is (X,y)

  2. Scheduler :
    num_training_steps = int(epochs (len(train_loader)/dist.get_world_size()))
    scheduler = get_scheduler(“linear”,optimizer=optimizer,num_warmup_steps=int(0.1
    (len(train_loader)/dist.get_world_size())),num_training_steps=num_training_steps)
    #get_schedule is from huggingface

  3. scheduler.get_last_lr()[0] → this is how I stored values in tensorboard
    [ScalarEvent(wall_time=1656084727.27605, step=100, value=3.676470441860147e-05), ScalarEvent(wall_time=1656084836.8856416, step=200, value=7.352940883720294e-05), ScalarEvent(wall_time=1656084946.7749376, step=300, value=9.989627142203972e-05),

If you call the fn “lr_lambda” which I have provided above
lr_lambda(200) * 0.0001 → 3.669724770642202e-05 which is equivalent of step 100

I will try to run it with single GPU,
Thanks for the response