Accumulate Gradient

Hello,
Because of memory constraints, I can only use batch_size of 1. But then I came across a trick called “Gradient Accumulation”. I have implemented two versions of it and would like to which one is correct and why, if possible.

Version 1:

# Backpropogation step
            total_loss.backward()

            # Accumulate Gradients and Update/Back-propagate at every 8th iteration.
            if (iter_i+1) % 32 == 0:
                optimizer.step()
                optimizer.zero_grad()

Version 2:

            # Accumulate Gradients and Update/Back-propagate at every 8th iteration.
            if (iter_i+1) % 32 == 0:
                total_loss.backward()
                optimizer.step()
                optimizer.zero_grad()

The total_loss consists of the following loss functions:

cls_loss_function = HeatmapLoss(reduction='mean')  # Custom Loss
txty_loss_function = nn.BCEWithLogitsLoss(reduction='none')
twth_loss_function = nn.SmoothL1Loss(reduction='none')

Should the total_loss be divided by 32 before using the backward function? I have several BatchNorm2d layers in the architecture.

So the question here is whether the total_loss.backward() should be inside or outside the if statement.

This post gives some examples with detailed explanations of advantages and shortcomings for each one. :slight_smile:

@ptrblck Thank you for replying. I have checked that post but there seem to be some discrepancies between that post and the steps mentioned here. I will post the code here:

model.zero_grad()                                   # Reset gradients tensors
for i, (inputs, labels) in enumerate(training_set):
    predictions = model(inputs)                     # Forward pass
    loss = loss_function(predictions, labels)       # Compute loss function
    loss = loss / accumulation_steps                # Normalize our loss (if averaged)
    loss.backward()                                 # Backward pass
    if (i+1) % accumulation_steps == 0:             # Wait for several backward steps
        optimizer.step()                            # Now we can do an optimizer step
        model.zero_grad()                           # Reset gradients tensors
        if (i+1) % evaluation_steps == 0:           # Evaluate the model when we...
            evaluate_model()                        # ...have no gradients accumulated

Above, the loss is divided by the accumulation steps before using the backward function. But it is also mentioned in the comment “Normalize our loss (if averaged)”. I don’t really know what that means.

Also, should I still use gradient accumulation if I have BatchNorm2D Layers in my network?

Thank You.

Often you would divide the loss by the accumulation_steps so this point seems to be indeed missing in the linked post.

You could try to use it, but note that the smaller the actual batch size is the more noise the running stats in batchnorm layers could get, which is not specific to gradient accumulation but a general limitation of small batch sizes in combination with batchnorm layers.

@ptrblck I have another system in which I can use a batch size of max 8. But I read on a blog that BatchNorm works best with batch_size 32. Would you recommend using InstanceNorm2D or GroupNorm instead of BatchNorm2D?

Edit:
After reading the GroupNorm paper it is clear that GorupNorm combined with Weight Standardization should be able to theoretically outperform BatchNorm for smaller batch sizes. What do you think?

I also wanted to suggest to take a look at the GroupNorm paper, which you’ve already done.
However, I don’t have any stats to share when one or the other approach would work better, so in case you are comparing both, it would be great to hear back which experiment outperformed the other one.

@ptrblck Hi,

I will definitely post the results here. However, my next hurdle is using a pre-trained model, i.e. Can I Use the pretrained model provided by PyTorch and just change the BatchNorm layers? Or with GroupNorm (GN) + Weight Standardization (WS), the pretrained weights become redundant? If so, do you where I can find the pre-trained ResNet models with GN+WS?

The code for GN+WS can be found here.

I really don’t have the time or system capabilities to train the network from scratch.

I’m not sure, but wouldn’t expect that replacing batchnorm layers with GroupNorm once would work out of the box (I might be wrong, as I haven’t looked into it deeply). Also, I’m not aware of any (official) pretrained ImageNet models using GroupNorm so you might indeed need to train them from scratch.

@ptrblck
Ahh no. That’s what I was afraid of. I will try training them from scratch. Are there any set hyperparameters (Learning rate, No.of Epochs, Optimizer(SGD, Adam)) one should use for training the ResNet Model from scratch. The problem I have is that I have only 2750 Training images.