I am working on a regression problem involving a very sparse dataset (<1% non-zero examples). One of the strategies I would like to try is to accumulate the gradient for each batch over an entire epoch, and only then update the model (since a batch has many zero examples and the gradients quickly tend to zero). I am thinking of a pseudocode similar to the following:
for e in epochs: grad = 0 for batch in dataloader: grad += model.backward(batch) grad /= len(dataloader) model.update(grad)
My current code that updates the gradient after every batch is:
for e in epochs: for batch in dataloader: # Forward pass optimizer.zero_grad() x: torch.tensor = batch.float().to(device) y: torch.tensor = batch.float().to(device) y_pred = model_reg(x).squeeze() loss = criterion(y_pred, y) # Backward pass loss.backward() optimizer.step()
Is there a way I could do what I am thinking of in Pytorch? I know autograd is a bit special, so I wanted to seek your wisdom in these matters. My current idea was to have the optimizer.zero_grad() and optimizer.step() being done outside of the batch loops (so done only once per epoch). However, I am not sure how I could divide the accumulated gradient at the end of an epoch by the number of batches.
I have been looking through the documentation and past topics and couldn’t find an answer. Any advice or help are greatly appreciated!