Hi,
I am wondering is there any tutorials or examples about the correct usage of learning rate scheduler when training with DDP/FSDP? For example, if the LR scheduler is OneCycleLR
, how should I define total number of steps in the cycle
, i.e., total_steps
or (steps_per_epoch
and epochs
) arguments of the scheduler?
The reason I am asking is that this scheduler updates LR based on the steps but in DDP the steps for current process is different from effective steps (i.e., in non-DDP, the step is 256, but in DDP with 2 GPU, the step is 128 for each process).
Also, where should I call scheduler.step()
if I am also using amp
? Assume my code is:
steps_per_epoch = ???
n_epoch = 10
scheduler = (optimizer, max_lr=0.01, steps_per_epoch=steps_per_epoch, epochs=n_epoch)
for it, (img, labels) in enumerate(train_loader):
# measure data loading time
data_time.update(time.time() - end)
bs = img.size(0)
img = img.cuda(non_blocking=True)
labels = labels.cuda(non_blocking=True)
# compute output and loss
with autocast(enabled=args.amp, dtype=args.amp_dtype):
outputs = model(img)
loss = criteria(outputs, labels)
optimizer.zero_grad()
scaler.scale(loss).backward()
scaler.step(optimizer)
scaler.update()
# scheduler.step() is this position correct?
Thank you!