How will SGD work under batch data?

hanglearning · November 30, 2022, 8:48pm

I’ve learned that stochastic gradient descent (SGD) updates weights one sample after another in random order. Then I saw the following code example. My understanding is that, within one iteration, this code extracts a batch of data, say 10 samples if the batch size is set to 10, and then SGD will update weights one sample after another within these 10 samples in random order.

optimizer = torch.optim.SGD(lr=lr, params=model.parameters())
for batch_id, (inputs, labels) in enumerate(train_dataloader):
    y_hat = model(x)
    loss = F.cross_entropy(y_hat, y)
    model.zero_grad()

    loss.backward()
    optimizer.step()

However, I see from this post that @ chenyuntc says that “SGD optimizer in PyTorch actually is Mini-batch Gradient Descent with momentum”. To my understanding, in this case, SGD does not update weights one sample after another in random order, but updates weights by processing a batch of 10 data points together, basically just no stochasticity.

My question is which one is true? Thanks!

ptrblck · November 30, 2022, 10:58pm

Yes, the post is correct and the entire batch will be used to compute the gradients in the loss.backward() call. The optimizer will then use the .grad attribute of each parameter to perform the actual parameter update step.

hanglearning · December 1, 2022, 12:13am

Thank you very much! Then, is there an optimizer in PyTorch that does traditional SGD? I mean, iteratively randomly picking up one data sample for updating the weights.

Thank you

ptrblck · December 1, 2022, 12:22am

Yes, this would be the case if you specify batch_size=1 or call backward, optimizer.step(), and optimizer.zero_grad() on each output sample individually.

hanglearning · December 1, 2022, 12:26am

Oh! Got it, thank you again ptrblck!!