I am working on a regression problem involving a very sparse dataset (<1% non-zero examples). One of the strategies I would like to try is to accumulate the gradient for each batch over an entire epoch, and only then update the model (since a batch has many zero examples and the gradients quickly tend to zero). I am thinking of a pseudocode similar to the following:

```
for e in epochs:
grad = 0
for batch in dataloader:
grad += model.backward(batch)
grad /= len(dataloader)
model.update(grad)
```

My current code that updates the gradient after every batch is:

```
for e in epochs:
for batch in dataloader:
# Forward pass
optimizer.zero_grad()
x: torch.tensor = batch[0].float().to(device)
y: torch.tensor = batch[1].float().to(device)
y_pred = model_reg(x).squeeze()
loss = criterion(y_pred, y)
# Backward pass
loss.backward()
optimizer.step()
```

Is there a way I could do what I am thinking of in Pytorch? I know autograd is a bit special, so I wanted to seek your wisdom in these matters. My current idea was to have the optimizer.zero_grad() and optimizer.step() being done outside of the batch loops (so done only once per epoch). However, I am not sure how I could divide the accumulated gradient at the end of an epoch by the number of batches.

I have been looking through the documentation and past topics and couldn’t find an answer. Any advice or help are greatly appreciated!

Thanks,

Alex