Hey; here is an optimizer from paper “Minimizing Finite Sums with the Stochastic Average Gradient”. algorithm 1

In general, the loss function is sum/mean of each individual loss functions (each data point corresponding to a loss function and we sum/mean over the data points)

In this optimizer, we need to save all most recent gradients of each loss functions.

I’m wondering if there is any efficient way of implementing this optimizer?

Currently

```
loss = (y_pred - y).pow(2) # loss is a tensor with shape (N,)
grad = torch.autograd.grad(loss.split(0), [weight]) # loss.split(0) creates N loss function
```

however, I was expecting to have `grad`

shape such that N times `gradient_shape`

; but the result shows that it sum over all `N`

. Is there anyway to fix this?