[SOLVED] Gradient Descent and Batches

Hey, I wanted to ask if there is a difference of feeding a batch to the model and then update the weights or feeding data solely to the model and sum up the loss and update after the batch size. I am aware that this depends also on the used criterion, but can I do the following example in praxis? (My GPU-mem is too small to train with big batches and I would like to know if this is a workaround)

batch = load_batch()
out = model(batch)
loss = criterion(out, label)
loss.backward()
optimizer.step()
loss = 0
for data in batch:
    out = model(data.unsqueeze(0))
    loss += criterion(out, label)
loss.backward()
optimizer.step()

If you use reduction='sum' in your criterion, the gradients should be the same:

model = nn.Linear(10, 2)
criterion = nn.NLLLoss(reduction='sum')

x = torch.randn(2, 10)
y = torch.empty(2, dtype=torch.long).random_(2)

# use whole batch
loss = criterion(model(x), y)
loss.backward()
model_grad_batch = model.weight.grad.clone()
print(model_grad_batch)

# use separate samples
model.zero_grad()

loss = 0
for idx in range(x.size(0)):
    loss += criterion(model(x[idx].unsqueeze(0)), y[idx].unsqueeze(0))
loss.backward()
model_grad_sep = model.weight.grad.clone()
print(model_grad_sep)

torch.allclose(model_grad_batch, model_grad_sep)

Some layers like nn.BatchNorm will behave differently, so maybe you could try to tune the momentum etc.

1 Like

One small question: I set optimizer.zero_grad() before each new batch, right?

Yes, before the backward call.