SGD naming mismatch?

fryasdf · November 12, 2018, 2:41pm

Hi all. I am sorry to bug you again with this kind of question but the notation is really confusing (if not wrong in my opinion). The question is about torch.optim.SGD.

Question in short: Isn’t torch’s SGD just a regular ‘step into the direction (-gradient)-optimizer’ instead of SGD? Shouldn’t it be named ‘GD’ instead of ‘SGD’?

(lengthly) explanation:

Usually we do gradient descent (ignoring all the overfitting issues for a moment) like this: Given a data set x_i (i=0,1,…N) with true answers y_i (i=0,1,…,N) and some function shape f (that we have maybe implemented in pytorch) depends on the input x and some model parameter theta and some loss function l we abstractly compute the gradient g = g(x, y, theta) of l(y, f(x,theta)) w.r.t. theta, we initialize the theta to some value theta_0 and then we put
theta_(t+1) = theta_t + (1/N)*sum_i g(x_i, y_i, theta_t)
and then we iterate. As we cannot do this with complicated gradients and large training sets we use SGD with mini batches which (according to my understanding) is the following:

iterate
select some random subset i_1, …, i_M of fixed size M from the training set
theta_(t+1) = theta_t + (1/M)*sum_k g(x_i_k, y_i_k, theta_t)

so the SGD function should somehow be linked to the current minibatch… Lets say very concretely that we consider linear regression with current weights=[3,4] and bias=-1and the training batch is [[1,2], [3,3]] and the desired outputs are [[10], [-1]] and we use MSE as loss. Then theta = [w0, w1, b] and the gradient is
g(x, y, theta) = 2(yhat-y) * [x0, x1, 1]
i.e. we get two gradients for the two training samples, namely [-10, -20, -10] and [126,126,42]. So I implemented this in torch in the following way:

linearFunction = nn.Linear(2, 1, True)
# initialize weights...
weights = torch.tensor([[3, 4]], dtype=torch.float)
weightsAsParam = torch.nn.Parameter(weights, requires_grad=True)
bias = torch.tensor([-1], dtype=torch.float)
biasAsParam = torch.nn.Parameter(bias, requires_grad=True)
linearFunction.weight = weightsAsParam
linearFunction.bias = biasAsParam

x = torch.tensor([[1, 2], [3,3]], dtype=torch.float)
y = torch.tensor([[15], [-1]], dtype=torch.float)
yhat = linearFunction(x)
lossFunction = torch.nn.MSELoss()
loss = lossFunction(yhat, y)
loss.backward()

now I want to see the gradients… However, linearFunction.weight.grad (which is supposed to give me two gradients (the one for the first example and then the one for the second)) gives me

tensor([[58., 53.]])

which is the mean of the two single gradients… So, what SGD effectively does is to step into the direction of (-mean gradient) but the task of selecting the minibatch and of forming the mean gradient is done by other parts of torch… so what exactly is the S in SGD?

Is it even possible to let the tensors linearFunction.weight and linearFunction.bias know two different gradients?

Regards,

Fabian Werner

albanD · November 12, 2018, 2:51pm

Yes the update step for GD and SGD are the same and so from the optimizer point of view they are both the same.
If you want to the S to be meaningful, you can see it as Sub Gradient Descent But here again if you had real gradients, the update would be the same.

It is not possible to save gradients for all the samples. Doing so would require a significant amount of memory and potentially prevent some optimizations in the implementation.

fryasdf · November 12, 2018, 2:54pm

I see, I just wanted to make sure that I understood correctly… Thanks !!