Gradient clipping and gradient accumulation together

I try to do gradient accumulation (from this) for RNN model, so I also need to clip gradient. Could you please clarify if I’m doing rigth these two operations in the code below (in the right order and logic)?

loss_function = nn.BCEWithLogitsLoss()
num_batches = 1
running_loss = 0.0
iters_to_accumulate = 4
scaler = GradScaler()

model.train()
for batch in tqdm(train_generator, desc='Training'):
        with autocast(device_type=device.type, dtype=torch.float16):
            output = torch.flatten(model(batch['features'], batch['category'])).to(device)
            batch_loss = loss_function(output, batch['label'].float())
            batch_loss = batch_loss / iters_to_accumulate

        scaler.scale(batch_loss).backward()

        if (num_batches) % iters_to_accumulate == 0:
            scaler.unscale_(optimizer)
            torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.)
            scaler.step(optimizer)
            scaler.update()
            optimizer.zero_grad()

        num_batches += 1

When you do gradient accumulation are you supposed to average the gradients first before running the optimizer? If the point of gradient accumulation is to amortize the cost of the optimizer over a greater number of steps, then I’d guess no.

In that case, you should increase your max_norm corresponding to how many iters you accumulate. Another question is how you are deciding your max_norm in the first place.

That’s just my intuition, but curious what the literature says here.

Thank you, I didn’t find any evidence-based articles on how to choose max_norm value, I chose it empirically, depending on how the gradient norm of my model changes

I would like to read the opinion of moderators also:)

Your code looks fine as it sticks to this example.