Why do we need to set the gradients manually to zero in pytorch?

Thanks a lot! I could not understand what does accumulating the gradient meant.

Hi,doing the grad accumulation using this way. But I found that loss will increase.

    for batch_step ,data in enumerate(dataloader):
        g_loss =loss/ step   
        g_loss.backward()
        if (batch_step +1) % step == 0:                   
           optimizerG.step()
           optimizerG.zero_grad() 

But when I change the normal training like this,everything will be ok. I don’t know why,please help!!

    for batch_step ,data in enumerate(dataloader):
        optimizerG.zero_grad()
        g_loss =loss   
        g_loss.backward()                  
        optimizerG.step()
            

@albanD assume that my GPU can only handle 1 sample per batch, and I want to optimize with batch size 32, so I will select option 2 with mini-batch is 1 and update parameters when:

if (i+1)%32 == 0:
   opt.step()
   opt.zero_grad()

Normally, the gradient formula should be the average of all samples gradient in 1 batch size. However, as I know, when I do loss.backward() 32 times, it will accumulate a gradient of 32 samples which means it is the sum of the gradient, not average. Am I understand correctly? If yes, how can we get the average of it?

Hi,

It usually depends on how your loss is computed indeed.
If you loss is the sum of the loss on each sample, then the gradients after a regular backward will be sum.
If you take the means of the loss on each sample, then you get the mean gradients as well.

In the example above, the “regular” gradient is the sum, so this other approach gives you the sum as well.
But if you want the mean, you can either keep your loss as mean and divide by the number of batch. Or set your loss to sum and divide by the total number of samples contained in all of the batches used.

If I select this way, I have to implement as third way in your example which will look like this:

# some code
# Initialize dataset with batch size 1
loss = 0
for i, (input, target) in enumerate(dataset):
    pred = net(input)
    current_loss = crit(pred, target)
    # current graph is appended to existing graph
    loss = loss + current_loss
    if (i+1)%64 == 0:
        opt.zero_grad()
        loss/=64
        loss.backward()
        # huge graph is cleared here
        opt.step()

Am I understand correctly?

You don’t have to do that.
If you know in advance how many elements will be in the batch, you can divide each intermediary loss by that number to get the same behavior.

If you don’t know it in advance, you can compute the gradients efficiently as before and then re scale later:

with torch.no_grad():
    for p in net.parameters():
        p.grad.div_(batch_size)

Okay, I understand now, thank you for your help

Don’t we need retain_graph=True in case 2 and 3 with loss.backward(retain_graph=True)?

It should be mentioned that the case 3 is notably faster than 2 when running DistributedDataParallel. This is not surprise since synchronization (.backward() call) in case 3 happens less times.

Hello @albanD, I read through all the posts in this thread but I don’t think I’ve found a similar answered question.

In your 2nd option, I understand that you are accumulating gradients, since we take a step (update the weights) every 10 iterations. Considering that crit(pred, target) usually will return the mean loss of the batch, in order to make this behave as if without accumulation, shouldn’t we also do loss/10 before calling loss.backward()?

Thanks in advance and sorry for pinging after so long.

Yes there is a constant factor here.
But since the learning is usually arbitrary anyway, it doesn’t change much.

But you can divide each loss by the number of times they are accumulated if you need, yes.

1 Like

Thanks a lot for answering! Have a nice day.

Right. But I think even in DDP mode, you can still accumulate the gradients locally for a certain amount of steps, and then do the communications every N steps, too. Which means case 2 can also be implemented which btw uses much less memory as analyzed before.

@albanD Thanks for sharing your insights in this post. I’m not sure if I should open up a new post or ask here, so I will just post my questions here.
I have 2 questions regarding using the second approach. The situation is that I’m pre-training a BERT-large model on phase-2 using DDP with 8 gpus on a single node. As you know, the model is quite large, around 300m parameters. I cannot use a batch size more than e.g. 24. I noticed that the communication is very slow on my machine. So that’s why I want to do more local gradient accumulations, then do AllReduce. Say, for example, every 8 steps of local loss backward, I will do one communication step. That means I will have a global batch size of 8 *2 4 * 8 = 1536 now.
Here is my question:

  1. Is this kind of equivalent to running with 8 nodes, 8 gpus, each with 24 batch size, and without any gradient accumulation. By saying equivalent, I mean with the same number of trained samples, should they get similar loss, mlm accuracy?
  2. How about comparing to 1 node, 8 gpus with 24*8 batch size(I know the GPU memory will be huge, but this is just an assumption)? This is actually the same as comparing “case 1 and case 2” as metioned in some post above. Because as I see, if I run bs=24 for 8 times locally with gradient accumulation. Each time, I’m only using bs=24, the variance is larger than using directly bs=192. Although we are accumulating the gradients to smooth the variance. But each time the 24 samples are running separately, so I’m not sure if this “smoothing in the end” is equavilent to “smoothing in the front(with bs = 192)”. I might not express my thoughts well, apologize for that.
  3. Also related to this, should I adjust the learning rate? If 1 node, 8 gpus, bs = 24. I use a learning rate of 1e-4, what should I use here for 1node, 8 gpus, bs=24, accumuate 8 steps? If according to the squart root LR scaling rule, I think I should use 1e-4 * sqrt(8) = 2.8e-4?
  4. By the way, I’m using APEX distributed fused lambd optimizer which does the DDP thing internally. Does pytorch plan to support distributed Fused lambd optimizer in the future btw? :grinning:

I wonder which line of torch.autograd.backward() implements the accumulation behavior?

Thanks for such a nice explanation.

Can you please share your thoughts on performing backpropagation after two (or more) passes of feedforward?

Alternatively, what if loss = crit(pred, target) in case two is moved down under if condition and multiple pred are collected first before loss calculation?

I have a training batch of 64 and need 128 predictions at least before calculating the loss. GPU I’m using doesn’t allow batch_size > 64.

Hi, I have a question about the third example code would the loss be zero in the if condition?

Hello

@ albanD and @ ptrblck

For the 3rd option, accumulating the losses on GPU memory is not practical.
How about accumulating the losses on CPU memory and backpropagating using the accumulated losses on CPU memory? is it possible?

I am trying to find an answer to my question on

Option 3 is more for completeness than anything else. It is both slower AND uses more memory than the two above. So you should use either 1 or 2 depending on the memory constraint you have.

Hi, it is so great to have your three examples! could you also kindly to explain in this way for the different result of net.zero_grand() in the iteration of dataloader, when do the iter_size ?