Hi I have a bad feeling about on my code, although it does not return any error.
for idx, inputs in enumerate(data_loader):
data_time.update(time.time() - end)
bsz = inputs.size(0)
config.STEP = config.STEP + 1
inputs = inputs.to(config.DEVICE)
# TODO: update???
x1, x2 = torch.split(inputs, [3, 3], dim=1)
x1 = x1.cuda(non_blocking=True)
x2 = x2.cuda(non_blocking=True)
fea1 = net(x1)
fea2 = net(x2)
I am calculating the loss based on the
fea2. Is that correct? Does
fea1 contribute to the gradient?
Anything that is computed in a differentiable way and that contribute to the loss will also contribute to the computes gradients. So yes, both will participate in the gradients in the paramters of
@albanD then can we split a large batch to
N small batches, then get an accumulated results after
N forwards to increase our batch size? I didn’t see some one used such a way to increase the batch size, instead they do accumulated based on the back propagated gradient.
You can do any combination, depending on what your constraints are. You can see this post for a more detailed description: Why do we need to set the gradients manually to zero in pytorch?
Thanks for answering. Find out many interesting related discussion.