Suppose this is our data:
X = torch.tensor([[0., 0.], [0., 1.], [1., 0.], [1., 1.]], requires_grad=True)
y = torch.tensor([[0], [1], [1], [0]], dtype=torch.float32)
X, y
And we can employ GD with:
model = FFN()
optimizer = optim.Adam(model.parameters(), lr=0.01)
loss_fn = torch.nn.MSELoss()
for _ in range(1000):
output = model(X)
loss = loss_fn(output, y)
loss.backward()
optimizer.step()
optimizer.zero_grad()
PyTorch abstracts things but basically it allows me to pass in multiple inputs and computes multiple outputs ‘at the same time’ somehow. FFN expects a 2-dim tensor but I’m giving it 4 2-dim tensors.
Though, GD means we perform the update after performing the forward pass over all the data, making one epoch equivalent ot one step.
So in theory, this one should be also a correct GD:
model = FFN()
optimizer = optim.Adam(model.parameters(), lr=0.01)
loss_fn = torch.nn.MSELoss()
for _ in range(1000):
for inputs, labels in zip(X, y):
output = model(inputs)
loss = loss_fn(output, labels)
loss.backward()
for param in model.parameters():
if param.grad is not None:
param.grad /= len(X)
optimizer.step()
optimizer.zero_grad()
Note, I’m also doing /= len(X)
because unless I’m mistaken, for GD you’re supposed to take the average gradient of the loss not the summed gradients of the loss but I could be mistaken. I’m not sure if PyTorch does this internally as well.
Are both approaches valid? I’m aware that the second one will not scale well but I want to confirm if theoretically, both approaches are correct ways to employ Gradient Descent.