I’ve learned that stochastic gradient descent (SGD) updates weights one sample after another in random order. Then I saw the following code example. My understanding is that, within one iteration, this code extracts a batch of data, say 10 samples if the batch size is set to 10, and then SGD will update weights one sample after another within these 10 samples in random order.
optimizer = torch.optim.SGD(lr=lr, params=model.parameters())
for batch_id, (inputs, labels) in enumerate(train_dataloader):
y_hat = model(x)
loss = F.cross_entropy(y_hat, y)
model.zero_grad()
loss.backward()
optimizer.step()
However, I see from this post that @ chenyuntc says that “SGD optimizer in PyTorch actually is Mini-batch Gradient Descent with momentum”. To my understanding, in this case, SGD does not update weights one sample after another in random order, but updates weights by processing a batch of 10 data points together, basically just no stochasticity.
My question is which one is true? Thanks!