Is gradient calculated based on the last forward operation?

Hi I have a bad feeling about on my code, although it does not return any error.

    for idx, inputs in enumerate(data_loader):
        data_time.update(time.time() - end)
        bsz = inputs.size(0)

        config.STEP = config.STEP + 1
        inputs =
        # forward
        # TODO: update???
        x1, x2 = torch.split(inputs, [3, 3], dim=1)
        x1 = x1.cuda(non_blocking=True)
        x2 = x2.cuda(non_blocking=True)

        with torch.set_grad_enabled(config.TRAIN_BACKBONE):
            fea1 = net(x1)
            fea2 = net(x2)

I am calculating the loss based on the fea1 and fea2. Is that correct? Does fea1 contribute to the gradient?

Anything that is computed in a differentiable way and that contribute to the loss will also contribute to the computes gradients. So yes, both will participate in the gradients in the paramters of net.

@albanD then can we split a large batch to N small batches, then get an accumulated results after N forwards to increase our batch size? I didn’t see some one used such a way to increase the batch size, instead they do accumulated based on the back propagated gradient.

You can do any combination, depending on what your constraints are. You can see this post for a more detailed description: Why do we need to set the gradients manually to zero in pytorch?

1 Like

Thanks for answering. Find out many interesting related discussion. :smiley: