Accumulating Gradients

I want to accumulate the gradients before I do a backward pass. So wondering what the right way of doing it is. According to this article it’s (let’s assume equal batch sizes):

model.zero_grad()                                   # Reset gradients tensors
for i, (inputs, labels) in enumerate(training_set):
    predictions = model(inputs)                     # Forward pass
    loss = loss_function(predictions, labels)       # Compute loss function
    loss = loss / accumulation_steps                # Normalize our loss (if averaged)
    loss.backward()                                 # Backward pass
    if (i+1) % accumulation_steps == 0:             # Wait for several backward steps
        optimizer.step()                            # Now we can do an optimizer step
        model.zero_grad()

whereas I expected it to be:

model.zero_grad()                                   # Reset gradients tensors
loss = 0
for i, (inputs, labels) in enumerate(training_set):
    predictions = model(inputs)                     # Forward pass
    loss += loss_function(predictions, labels)       # Compute loss function                              
    if (i+1) % accumulation_steps == 0:             # Wait for several backward steps
        loss = loss / accumulation_steps            # Normalize our loss (if averaged)
        loss.backward()                             # Backward pass
        optimizer.step()                            # Now we can do an optimizer step
        model.zero_grad()     
        loss = 0  

where I accumulate the loss and then divide by the accumulation steps to average it.

Secondary question, if I am right, would you expect my method to be quicker considering I only do the backward pass every accumulation steps?

This is a crosspost from SO.

2 Likes

Hi,

This has been discussed in details in this post.
Let me know if you have further questions !

1 Like

Thanks for this. Could you see if the logic in my understanding below is correct?

Suppose that in my case accumulation_steps = 10. Therefore with my second code block, I would be creating the graph ten times, therefore would require ten times the memory. However, what I’m not 100% sure is, is this 10 times the memory to hold the parameters, or 10 times the parameters plus the intermediate values calculated in each batch?

The second question is, isn’t loss.backward() the most compute intensive step, since this is where gradients are calculated. So provided that I can hold 10 graphs in memory, wouldn’t this option be the fastest?

Hi,

It’s never 10 times the parameters. It’s only 10times the intermediary states.

The amount of work in both case is the same. The difference will be that you call a single backward, but the graph it backward through is 10 times bigger. So compute wise it will be the same.

2 Likes

Hi,
If I use T5 language model, there isn’t any batch-norm layer inside, should I use averaged loss then?
Or I can just use loss.backward() on every iteration and opt.step() opt.zero_grad() when iteration % accumulation_steps == 0 ?