Is SGD doc formula for a single sample or a mini-batch?

Is the formula presented at SGD — PyTorch 2.2 documentation for each sample or for a mini-batch ? I ask because I want to know if for a mini-batch of size n, the formula used internally would be g_t​ ← g_t ​+ n λ_{θ_t−1} instead of g_t​ ← g_t ​+ λ_{θ_t−1}.

Thanks

Hi Vincent!

It’s neither. SGD is unaware of your batch size or even whether you used
a batch at all.

All SGD does is update your parameters with a gradient-descent step based
on the .grad properties of those parameters. It knows nothing about how
those .grads were set. They could come from backpropagating a single
sample, from a batch of samples, from accumulating gradients across multiple
backpropagations by not calling .zero_grad() in between, or from setting
.grad manually according to some scheme.

Best.

K. Frank

1 Like