Is the formula presented at SGD — PyTorch 2.2 documentation for each sample or for a mini-batch ? I ask because I want to know if for a mini-batch of size n, the formula used internally would be g_t ← g_t + n λ_{θ_t−1} instead of g_t ← g_t + λ_{θ_t−1}.
It’s neither. SGD is unaware of your batch size or even whether you used
a batch at all.
All SGD does is update your parameters with a gradient-descent step based
on the .grad properties of those parameters. It knows nothing about how
those .grads were set. They could come from backpropagating a single
sample, from a batch of samples, from accumulating gradients across multiple
backpropagations by not calling .zero_grad() in between, or from setting .grad manually according to some scheme.