Gradient clipping and gradient accumulation together

loveis98 · October 11, 2023, 12:05pm

I try to do gradient accumulation (from this) for RNN model, so I also need to clip gradient. Could you please clarify if I’m doing rigth these two operations in the code below (in the right order and logic)?

loss_function = nn.BCEWithLogitsLoss()
num_batches = 1
running_loss = 0.0
iters_to_accumulate = 4
scaler = GradScaler()

model.train()
for batch in tqdm(train_generator, desc='Training'):
        with autocast(device_type=device.type, dtype=torch.float16):
            output = torch.flatten(model(batch['features'], batch['category'])).to(device)
            batch_loss = loss_function(output, batch['label'].float())
            batch_loss = batch_loss / iters_to_accumulate

        scaler.scale(batch_loss).backward()

        if (num_batches) % iters_to_accumulate == 0:
            scaler.unscale_(optimizer)
            torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.)
            scaler.step(optimizer)
            scaler.update()
            optimizer.zero_grad()

        num_batches += 1

soulitzer · October 11, 2023, 4:22pm

When you do gradient accumulation are you supposed to average the gradients first before running the optimizer? If the point of gradient accumulation is to amortize the cost of the optimizer over a greater number of steps, then I’d guess no.

In that case, you should increase your max_norm corresponding to how many iters you accumulate. Another question is how you are deciding your max_norm in the first place.

That’s just my intuition, but curious what the literature says here.

loveis98 · October 11, 2023, 5:29pm

Thank you, I didn’t find any evidence-based articles on how to choose max_norm value, I chose it empirically, depending on how the gradient norm of my model changes

loveis98 · October 16, 2023, 3:06pm

I would like to read the opinion of moderators also:)

ptrblck · October 16, 2023, 6:36pm

Your code looks fine as it sticks to this example.