I have recently learned, that SGD does normally not converge, if the learning rate is fixed. I wondered, how PyTorch handles the learning rate then when using optimizer = optim.SGD() as the optimizer, since it requires no parameters that seem to affect the learning rate. I wondered if momentum is…

How does PyTorchs SGD work?

tataganesh (Tata Ganesh) February 1, 2023, 9:18pm 4

The “stochastic” part in SGD comes from computing the gradient for mini-batches of the dataset, since Gradient Descent involves calculating the gradient for the full dataset.
This forum post might be helpful.