Hey, I wanted to ask if there is a difference of feeding a batch to the model and then update the weights or feeding data solely to the model and sum up the loss and update after the batch size. I am aware that this depends also on the used criterion, but can I do the following example in praxis? (My GPU-mem is too small to train with big batches and I would like to know if this is a workaround)

```
batch = load_batch()
out = model(batch)
loss = criterion(out, label)
loss.backward()
optimizer.step()
```

```
loss = 0
for data in batch:
out = model(data.unsqueeze(0))
loss += criterion(out, label)
loss.backward()
optimizer.step()
```

If you use `reduction='sum'`

in your criterion, the gradients should be the same:

```
model = nn.Linear(10, 2)
criterion = nn.NLLLoss(reduction='sum')
x = torch.randn(2, 10)
y = torch.empty(2, dtype=torch.long).random_(2)
# use whole batch
loss = criterion(model(x), y)
loss.backward()
model_grad_batch = model.weight.grad.clone()
print(model_grad_batch)
# use separate samples
model.zero_grad()
loss = 0
for idx in range(x.size(0)):
loss += criterion(model(x[idx].unsqueeze(0)), y[idx].unsqueeze(0))
loss.backward()
model_grad_sep = model.weight.grad.clone()
print(model_grad_sep)
torch.allclose(model_grad_batch, model_grad_sep)
```

Some layers like `nn.BatchNorm`

will behave differently, so maybe you could try to tune the `momentum`

etc.

1 Like

One small question: I set `optimizer.zero_grad()`

before each new batch, right?

Yes, before the `backward`

call.