Is gradient calculated based on the last forward operation?

neouyghur · March 4, 2020, 7:59am

Hi I have a bad feeling about on my code, although it does not return any error.

    for idx, inputs in enumerate(data_loader):
        data_time.update(time.time() - end)
        bsz = inputs.size(0)

        config.STEP = config.STEP + 1
        inputs = inputs.to(config.DEVICE)
        # forward
        # TODO: update???
        x1, x2 = torch.split(inputs, [3, 3], dim=1)
        x1.contiguous()
        x2.contiguous()
        x1 = x1.cuda(non_blocking=True)
        x2 = x2.cuda(non_blocking=True)

        with torch.set_grad_enabled(config.TRAIN_BACKBONE):
            fea1 = net(x1)
            fea2 = net(x2)

I am calculating the loss based on the fea1 and fea2. Is that correct? Does fea1 contribute to the gradient?

albanD · March 4, 2020, 8:41pm

Anything that is computed in a differentiable way and that contribute to the loss will also contribute to the computes gradients. So yes, both will participate in the gradients in the paramters of net.

neouyghur · March 4, 2020, 11:00pm

@albanD then can we split a large batch to N small batches, then get an accumulated results after N forwards to increase our batch size? I didn’t see some one used such a way to increase the batch size, instead they do accumulated based on the back propagated gradient.

albanD · March 5, 2020, 2:38pm

You can do any combination, depending on what your constraints are. You can see this post for a more detailed description: Why do we need to set the gradients manually to zero in pytorch?

neouyghur · March 9, 2020, 5:32am

Thanks for answering. Find out many interesting related discussion.