 I want to accumulate the gradients before I do a backward pass. So wondering what the right way of doing it is. According to this article it’s (let’s assume equal batch sizes):

``````model.zero_grad()                                   # Reset gradients tensors
for i, (inputs, labels) in enumerate(training_set):
predictions = model(inputs)                     # Forward pass
loss = loss_function(predictions, labels)       # Compute loss function
loss = loss / accumulation_steps                # Normalize our loss (if averaged)
loss.backward()                                 # Backward pass
if (i+1) % accumulation_steps == 0:             # Wait for several backward steps
optimizer.step()                            # Now we can do an optimizer step
``````

whereas I expected it to be:

``````model.zero_grad()                                   # Reset gradients tensors
loss = 0
for i, (inputs, labels) in enumerate(training_set):
predictions = model(inputs)                     # Forward pass
loss += loss_function(predictions, labels)       # Compute loss function
if (i+1) % accumulation_steps == 0:             # Wait for several backward steps
loss = loss / accumulation_steps            # Normalize our loss (if averaged)
loss.backward()                             # Backward pass
optimizer.step()                            # Now we can do an optimizer step
loss = 0
``````

where I accumulate the loss and then divide by the accumulation steps to average it.

Secondary question, if I am right, would you expect my method to be quicker considering I only do the backward pass every accumulation steps?

This is a crosspost from SO.

1 Like

Hi,

This has been discussed in details in this post.
Let me know if you have further questions !

Thanks for this. Could you see if the logic in my understanding below is correct?

Suppose that in my case `accumulation_steps = 10`. Therefore with my second code block, I would be creating the graph ten times, therefore would require ten times the memory. However, what I’m not 100% sure is, is this 10 times the memory to hold the parameters, or 10 times the parameters plus the intermediate values calculated in each batch?

The second question is, isn’t loss.backward() the most compute intensive step, since this is where gradients are calculated. So provided that I can hold 10 graphs in memory, wouldn’t this option be the fastest?

Hi,

It’s never 10 times the parameters. It’s only 10times the intermediary states.

The amount of work in both case is the same. The difference will be that you call a single backward, but the graph it backward through is 10 times bigger. So compute wise it will be the same.

2 Likes

Hi,
If I use T5 language model, there isn’t any batch-norm layer inside, should I use averaged loss then?
Or I can just use loss.backward() on every iteration and opt.step() opt.zero_grad() when iteration % accumulation_steps == 0 ?