My understanding is that SGD conceptually means running gradient descent on mini-batches that you sample randomly from the dataset (hence, the “stochastic” in the name). In all examples I’ve seen, the task of sampling data and shuffling the dataset is done independently of the optimizer used. If I always use my entire dataset and never shuffle it, calling optimizer.step() for SGD would essentially do regular gradient descent, right? Is there some parameter in SGD that actually makes it deserve the name “stochastic” that I’m missing?
albanD (Alban D) #2
No the only stochastic part is the sampling of subset of the whole dataset.