Why do we need to set the gradients manually to zero in pytorch?

Here are three equivalent code, with different runtime/memory comsumption.
Assume that you want to run sgd with a batch size of 100.
(I didn’t run the code below there might be some typos, sorry in advance)

1: single batch of 100 (least runtime, more memory)

# some code
# Initialize dataset with batch size 100
for input, target in dataset:
    pred = net(input)
    loss = crit(pred, target)
    # one graph is created here
    opt.zero_grad()
    loss.backward()
    # graph is cleared here
    opt.step()

2: multiple small batches of 10 (more runtime, least memory)

# some code
# Initialize dataset with batch size 10
opt.zero_grad()
for i, (input, target) in enumerate(dataset):
    pred = net(input)
    loss = crit(pred, target)
    # one graph is created here
    loss.backward()
    # graph is cleared here
    if (i+1)%10 == 0:
        # every 10 iterations of batches of size 10
        opt.step()
        opt.zero_grad()

3: accumulate loss for multiple batches (more runtime, more memory)

# some code
# Initialize dataset with batch size 10
loss = 0
for i, (input, target) in enumerate(dataset):
    pred = net(input)
    current_loss = crit(pred, target)
    # current graph is appended to existing graph
    loss = loss + current_loss
    if (i+1)%10 == 0:
        # every 10 iterations of batches of size 10
        opt.zero_grad()
        loss.backward()
        # huge graph is cleared here
        opt.step()

It should be clear that case 3 is not what you want.
The choice between case 1 and 2 is a tradeoff between memory and speed so that depends on what you want to do.
Note that if you can fit a batch size of 50 in your memory, you can do a variation of case 2 with batch size of 50 and update every 2 iterations.

95 Likes
Accumulating Gradients
How to increase the batch size but keep the gpu memory
How to create a dataloader with variable-size input
Add all the loss
Backpropagating through multiple optimizer steps
Compute the whole gradient of a mini-batch using the accumulated gradient of small mini-batches
Quenstion about one trick used when the GPU memory overflow
Getting different values for single batch vs accumulating gradients
Do a batch in multiple iteration
Should backward() function be in the loop of epoch or batch?
Using sub-batches to avoid busting the memory?
How to set a fixed steps after which weights are updated, expect updating after every batch
How to avoid CUDA out of memory
Gradient accumulation in an RNN with AMP
[DDP + amp + gradient accumulation] Calling optimizer.step() leads to NaN loss
Cuda out of memory error. Why do my tensors use too much memory?
How to model GD without having memory issues
Accumulate the gradients
Call .backward() to clear graph
MiniBatch size by iteration
Is my model overfitting?
Cuda out of memory error during forward pass
Difference between batch_input and "for loop"
How are losses aggregated over multiple computed batches?
How can I update the gradient manually once per epoch?
Virtual batches of SGD optimization?
Gradient Accumulation in Detectron2
Accumulating Batches for Contrastive Loss
CUDA out of memory - Transfer Learning
How to get the size of the memory needed to train a model?
How to mimic larger batch size when only 1 example fits into memory
Accumulate gradient with nonlinear global function with small memory footprint
Are there two valid Gradient Descent approaches in PyTorch?
Accumulate Gradient
Compute the whole gradient of a mini-batch using the accumulated gradient of small mini-batches
About backward method
Batchsize issue