Visual explanation for zero_grad()

Does anybody have or can point to a visual explanation for why we do zero_grad() in the training loop? What does it mean when we say the gradients accumulate?

I cannot explain it visually, but “accumulating gradients” means that they will be summed to the already calculated gradients.
Each parameter contains a .grad attribute after the first backward() operation, which will be set to the current gradient. You would use these gradients to update the parameters and zero them out afterwards.
However, some use cases (such as simulating a larger batch size) are accumulating the gradients for several iterations and perform the update step later.
Here is a small example:

# setup
model = nn.Linear(1, 1, bias=False)
x = torch.randn(1, 1)

# check gradients
print(model.weight.grad)

# fake training step to calculate gradients
out = model(x)
out.mean().backward()

# check gradient again
print(model.weight.grad)

# usually you would not update the model via optimizer.step()
# and zero out the gradients before the next backward() call

# accumulate gradients
for _ in range(10):
    out = model(x)
    out.mean().backward()

    # check gradient
    print(model.weight.grad)