How SGD works in pytorch

I am taking Andrew NG’s deep learning course. He said stochastic gradient descent means that we update weights after we calculate every single sample. But when I saw examples for mini batch training using pytorch, I found that they update weights every mini batch and they used SGD optimizer. I am confused by the concept.

2 Likes

You are right. SGD optimizer in PyTorch actually is Mini-batch Gradient Descent with momentum.

2 Likes

Thanks a lot! So is there difference between SGD and ASGD?

In PyTorch, there are multiple capabilities with respect to the SGD optimizer. Setting the momentum parameter to 0 gives you standard SGD. If momentum > 0, then you use momentum without the lookahead i.e., Classical Momentum. nesterov is a bool, which if set to true, provides the look ahead which is know as Nesterov’s Accelerated Gradient.

You may want to look up the implementation here.

2 Likes

Thanks! Maybe I should look up the implementation for details.

So what’s the Mini-batch’s size in PyTorch SGD optimizer?

what do you mean in mini-batch’s size

That is a hyperparameter that you have to decide. As a general rule of thumb, you do not want it to be very small (because in that case, you are not exploiting vectorized code, so it goes quite slow), and you definitely do not want it to be very high (recent studies show that training nets with large batches usually reach sharp minimas which do not generalize as well). Typically, for CNNs you see mini-batches which are powers of two between 16 (usually for large nets) to 128 or 256 (for smaller nets). For other architectures like FCN or R-CNNs people might use purely stochastic mini-batches (i.e batch-size = 1).

Implementation details: you define the size of the mini-batch in the data loader, not in the optimizer. Something like:

train_loader = torch.utils.data.DataLoader(train, batch_size=batch_size, shuffle=True, num_workers=12, pin_memory=True)

where train is your dataset, batch_size is your batch size (integer) and shuffle is if you want to shuffle the data (True in training, False in inference).

4 Likes

A hyperparameter, I thought it was the default setting in PyTorch SGD optimizer, but according to @Ismail_Elezi reply, I was wrong.

PyTorch’s optimizers usually only look at the .grad attribute of parameters, which means that all they define is a rule to update parameters given the gradients. They do not care if it is from a batch, a single data, or even manually set.

1 Like

Thanks. But if this is true, it means there is no “stochastic” in the SGD optimizer of Pytorch?

This is how I’m understanding it: PyTorch seems to be doing standard gradient descent (with optional enhancements like momentum) using gradient information provided by the user, and so the stochastic part is the user’s responsibility. That is, the user can achieve SGD by randomly sampling mini-batches from the data and computing gradients on those rather than all the data at once. This can easily be achieved using DataLoaders.

1 Like

It is annoying that pytorch calls its optimizer Stochastic Gradient Descent when it truly isn’t. There is nothing stochastic. It’s just standard full-batch gradient descent with optional momentum. Why don’t they just stick with calling it GD.