Zero_grad and learning rate

I am missing something here. if we zero out current gradient value with optimizer.zero_grad() then how does update happen to weights? For an update to happen we should have values in weight so we can do something like weight-=lr*grad?


You call zero grad at the start of the mini-batch. If you do not your gradients accumulate. When you zero_grad only the gradients are zero’d out not the weights. Once you do a forward() and loss.backward() the gradients will be propagated. You can accumulate the gradients by not calling optimizer.zero_grad()

hmm, I see it is done before every backward call:

for epoch in range(2):  # loop over the dataset multiple times

    running_loss = 0.0
    for i, data in enumerate(trainloader, 0):
        # get the inputs; data is a list of [inputs, labels]
        inputs, labels = data

        # zero the parameter gradients

        # forward + backward + optimize
        outputs = net(inputs)
        loss = criterion(outputs, labels)

        # print statistics
        running_loss += loss.item()
        if i % 2000 == 1999:    # print every 2000 mini-batches
            print('[%d, %5d] loss: %.3f' %
                  (epoch + 1, i + 1, running_loss / 2000))
            running_loss = 0.0

print('Finished Training')

here data is one batch then?

Yes. As you can see zero_grad() is called before beginning a loop with the data.