About stochastic gradient descent

ljh · September 16, 2021, 12:04pm

Graph attention network normally dose not support input to be a batch, I want to know that whether I can implement stochastic gradient descent by feed one data at one time, accumulate the loss and finally divide the loss by the batch_size that I define myself? Does it reach the same goal as input as a batch？
It should be like this:

### method1
batch_size = 64
loss_batch = 0
for i in range(batch_size):
    output = model(data)  # data.shape(224,224,3)
    loss = calculate the loss for the output 
    loss_batch = loss_batch + loss
loss_batch = loss_batch / batch_size
loss_batch.backward()

### method2
output = model(data_batch) # data_batch.shape(batch_size, 224,224,3)
loss_batch = calculate the loss for the output 
loss_batch.backward()

Dose the two methods reach the same goal?

googlebot · September 16, 2021, 2:29pm

Yes, though your method1 is bad, it is better to accumulate gradients with loss.backward() in a loop, this releases “backward graphs”. But you should use method2, unless there are obstacles with some unimplemented batched operations.

ljh · September 16, 2021, 2:57pm

Thanks for your reply.
I still have one question though.
If I accumulate the gradients with loss.backward() for n times and update the parameters with one opt.step(). Is it equivalent to n times of loss.backward() and n times of opt.step() with batch_size = 1. Or one loss.backward() and opt.step() with batch_size=n?

googlebot · September 16, 2021, 5:20pm

if you do N optimizer ‘steps’, your forward() will use N different sets of parameter values, it is not an accumulation.

You also have to scale loss or gradients or LR if doing accumulation, to have an equivalent effect with batches.