How SGD works in pytorch

I am taking Andrew NG’s deep learning course. He said stochastic gradient descent means that we update weights after we calculate every single sample. But when I saw examples for mini batch training using pytorch, I found that they update weights every mini batch and they used SGD optimizer. I am confused by the concept.

1 Like

You are right. SGD optimizer in PyTorch actually is Mini-batch Gradient Descent with momentum.

1 Like

Thanks a lot! So is there difference between SGD and ASGD?

In PyTorch, there are multiple capabilities with respect to the SGD optimizer. Setting the momentum parameter to 0 gives you standard SGD. If momentum > 0, then you use momentum without the lookahead i.e., Classical Momentum. nesterov is a bool, which if set to true, provides the look ahead which is know as Nesterov’s Accelerated Gradient.

You may want to look up the implementation here.


Thanks! Maybe I should look up the implementation for details.

So what’s the Mini-batch’s size in PyTorch SGD optimizer?

what do you mean in mini-batch’s size

That is a hyperparameter that you have to decide. As a general rule of thumb, you do not want it to be very small (because in that case, you are not exploiting vectorized code, so it goes quite slow), and you definitely do not want it to be very high (recent studies show that training nets with large batches usually reach sharp minimas which do not generalize as well). Typically, for CNNs you see mini-batches which are powers of two between 16 (usually for large nets) to 128 or 256 (for smaller nets). For other architectures like FCN or R-CNNs people might use purely stochastic mini-batches (i.e batch-size = 1).

Implementation details: you define the size of the mini-batch in the data loader, not in the optimizer. Something like:

train_loader =, batch_size=batch_size, shuffle=True, num_workers=12, pin_memory=True)

where train is your dataset, batch_size is your batch size (integer) and shuffle is if you want to shuffle the data (True in training, False in inference).


A hyperparameter, I thought it was the default setting in PyTorch SGD optimizer, but according to @Ismail_Elezi reply, I was wrong.

PyTorch’s optimizers usually only look at the .grad attribute of parameters, which means that all they define is a rule to update parameters given the gradients. They do not care if it is from a batch, a single data, or even manually set.

1 Like