Hey, I wanted to ask if there is a difference of feeding a batch to the model and then update the weights or feeding data solely to the model and sum up the loss and update after the batch size. I am aware that this depends also on the used criterion, but can I do the following example in praxis? (My GPU-mem is too small to train with big batches and I would like to know if this is a workaround)
batch = load_batch()
out = model(batch)
loss = criterion(out, label)
loss.backward()
optimizer.step()
loss = 0
for data in batch:
out = model(data.unsqueeze(0))
loss += criterion(out, label)
loss.backward()
optimizer.step()
If you use reduction='sum'
in your criterion, the gradients should be the same:
model = nn.Linear(10, 2)
criterion = nn.NLLLoss(reduction='sum')
x = torch.randn(2, 10)
y = torch.empty(2, dtype=torch.long).random_(2)
# use whole batch
loss = criterion(model(x), y)
loss.backward()
model_grad_batch = model.weight.grad.clone()
print(model_grad_batch)
# use separate samples
model.zero_grad()
loss = 0
for idx in range(x.size(0)):
loss += criterion(model(x[idx].unsqueeze(0)), y[idx].unsqueeze(0))
loss.backward()
model_grad_sep = model.weight.grad.clone()
print(model_grad_sep)
torch.allclose(model_grad_batch, model_grad_sep)
Some layers like nn.BatchNorm
will behave differently, so maybe you could try to tune the momentum
etc.
1 Like
One small question: I set optimizer.zero_grad()
before each new batch, right?
Yes, before the backward
call.