Implementation of Stochastic Average Gradient

Hey; here is an optimizer from paper “Minimizing Finite Sums with the Stochastic Average Gradient”. algorithm 1

In general, the loss function is sum/mean of each individual loss functions (each data point corresponding to a loss function and we sum/mean over the data points)

In this optimizer, we need to save all most recent gradients of each loss functions.

I’m wondering if there is any efficient way of implementing this optimizer?

Currently

loss = (y_pred - y).pow(2) # loss is a tensor with shape (N,)
grad = torch.autograd.grad(loss.split(0), [weight]) # loss.split(0) creates N loss function

however, I was expecting to have grad shape such that N times gradient_shape; but the result shows that it sum over all N. Is there anyway to fix this?