Are there two valid Gradient Descent approaches in PyTorch?

na50r · December 16, 2024, 1:08pm

Suppose this is our data:

X = torch.tensor([[0., 0.], [0., 1.], [1., 0.], [1., 1.]], requires_grad=True)
y = torch.tensor([[0], [1], [1], [0]], dtype=torch.float32)
X, y

And we can employ GD with:

model = FFN()
optimizer = optim.Adam(model.parameters(), lr=0.01)
loss_fn = torch.nn.MSELoss()

for _ in range(1000):
    output = model(X)
    loss = loss_fn(output, y)
    loss.backward()
    optimizer.step()
    optimizer.zero_grad()

PyTorch abstracts things but basically it allows me to pass in multiple inputs and computes multiple outputs ‘at the same time’ somehow. FFN expects a 2-dim tensor but I’m giving it 4 2-dim tensors.

Though, GD means we perform the update after performing the forward pass over all the data, making one epoch equivalent ot one step.

So in theory, this one should be also a correct GD:

model = FFN()
optimizer = optim.Adam(model.parameters(), lr=0.01)
loss_fn = torch.nn.MSELoss()

for _ in range(1000):
    for inputs, labels in zip(X, y):
        output = model(inputs)
        loss = loss_fn(output, labels)
        loss.backward()

    for param in model.parameters():
        if param.grad is not None:
            param.grad /= len(X)
    optimizer.step()
    optimizer.zero_grad()

Note, I’m also doing /= len(X) because unless I’m mistaken, for GD you’re supposed to take the average gradient of the loss not the summed gradients of the loss but I could be mistaken. I’m not sure if PyTorch does this internally as well.

Are both approaches valid? I’m aware that the second one will not scale well but I want to confirm if theoretically, both approaches are correct ways to employ Gradient Descent.

albanD · December 16, 2024, 4:07pm

Yes they’re both the same (up to numerical precision) in the numerics.
They will have different runtime/memory tradeoff though.
See details here: Why do we need to set the gradients manually to zero in pytorch? - #20 by albanD

na50r · December 16, 2024, 4:12pm

It is true that I have to divide the gradients by the number of samples in data in the second case, right? Otherwise it wouldn’t be equivalent.

I tested it out with Mean Square Loss and it seems to hold true.
The first basically says: Four Inputs, Four Outputs, One loss. This loss function I used was the MSELoss, so it will compute a ‘total loss’ for all four inputs and outputs by squaring all 4 differences, adding them up and dividing them by 4. The gradient is computed over this summed up loss function.

In the second case, I compute loss individually: Four Inputs, Four Outputs, Four Losses. I compute four different gradients and they accumulate with each iteration. For the update to be equivalent to the previous one, I have to divide by the number of samples, otherwise the grad values would be too big. This seems to work out because the way derivatives works.

albanD · December 30, 2024, 10:37am

Yes that is correct. The simplest way to make sure is to write the sum (and division) by hand. Then the usual associativity rules apply!