Ahh, actually sorry, it’s just a mismatch in terminology. The SGD optimizer is vanilla gradient descent (i.e. literally all it does is subtract the gradient * the learning rate from the weight, as expected). See here: How SGD works in pytorch
3 Likes