Is the SGD in Pytorch a real SGD?

Wmog · November 9, 2017, 12:31pm

After looking up the code of Pytorch’s SGD:
Pytorch’s SGD
it seems (excuse me in advance if my assertion is wrong) that the SGD is not a real SGD. Indeed, the way gradients are accumulated and more especially the order wherein they are accumulated is up to the user. Thereupon where is the randomness or stochastic character of this gradient descent?

According to Wikipedia:
SGD
the examples must be randomly shuffled in the training set after being used to compute the gradient. However here this stage is carried out by the user when he accumulated the gradient with the backward.

BartolomeD · November 9, 2017, 12:38pm

I think the stochastic character of SGD is that it approximates the true loss surface of the entire dataset by optimizing parameters based on samples of data (i.e. batches).

Wmog · November 9, 2017, 12:42pm

Yes but this approximation and the making of samples (and so the stochastic character) are up to the user, are not they? If the answer is yes then its more a “GD” than a “SGD” no?

BartolomeD · November 9, 2017, 12:46pm

In the end it is up to the user, I agree. If one would pass all data through the model and updating the weights with SGD afterwards, it would be normal Gradient Descent, I would say.

Wmog · November 9, 2017, 2:05pm

Thank you for your answer. Ok so if i understood your answer it is more a user case-specific Gradient Descent than an SGD. I think that the name of this class is then somewhat confusing.

albanD · November 9, 2017, 2:08pm

Hi,
The thing is that by this argument, all optimizer names are confusing.
The optimizers provided in pytorch actually implement the update steps corresponding to the algorithm that is used for their name.

Wmog · November 9, 2017, 4:48pm

Ok perfect, that was exactly what I thought. Actually, they should be named “Stepper”. For example with SGD that will be “SGDStepper”. That seems more clear.

K_Frank · March 2, 2019, 8:39pm

Hello Wmog (and others) -

I had the exact same question / confusion, so I’m glad I found this
thread. (Sorry to revive an oldish thread …)

I agree with your analysis. torch.optim.SGD doesn’t have any
randomness (stochasticity) in it, so it does seem misleading to
call it stochastic gradient descent. Calling it gradient descent
(or GD) would be more accurate. Or, as you note, perhaps calling
it “gradient descent optimization stepper” would emphasize that
it only performs one step at a time. (Maybe folks thought that
leaving “stochastic” out of the name would confuse people who
just wanted to use the “standard” SGD optimization algorithm.)

Best.

K. Frank

ghlai9665 · November 28, 2020, 4:00pm

Yeah - newcomer to PyTorch here and I find the SGD name really confusing too. I understand SGD as gradient descent with a batch size of 1, but in reality the batch size is determined by the user. So I agree that it would be much less confusing if it was named just GD because that’s what it is.

Aditya_Ranganath · June 18, 2021, 1:25am

Shouldn’t it just be called a “batch” gradient descent ?
(Apologies if I am wrong) If we compute the gradient using
$\theta_{k+1} = \theta_{k} + \eta \sum_{i=0}^m lossfn(output, prediction)$ (assuming no momentum)
where $m \in (0,N)$, are we not explicitly calculating the entire (averaged) gradient for that batch ?

artificial-cerebrum · June 23, 2021, 6:46pm

In the AlexNet paper they say:

We trained our models using stochastic gradient descent with a batch size of 128 examples.

Chao_Kong · February 1, 2022, 5:49am

I had exactly the same question too. And I suppose it should be called GD but neither SGD nor BGD(batch gradient descent).

What the nn.optim.SGD optimizer really does is to perform a step of gradient descent algorithm once optimzer.step() is called.

The stochastic or random part is done in the DataLoader. We’ve done shuffle in the DataLoader and then the GD algorithm is an SGD algorithm.
The batch part is done both in the DataLoader and the loss function. We’ve assigned a batch_size to the DataLoader and well tell the loss function, take L1_loss as an example, that we want a mean loss or a sum loss for the whole batch.

So, the SGD optimizer has nothing to do with neither the random part nor the batch part. It is only just a GD algorithm or a GD stepper.