Different gradients

InternalInstance · October 18, 2022, 8:49pm

To overcome a memory problem I split my training data into batches and process it in a sequential manner instead of feeding the whole data into the network.
But I get different gradients between the first and second examples given below.
I think that they are mathematically equivalent and should give the same result. Сorrect me if i’m wrong.
1.

for epoch in range(epochs):
    optimizer.zero_grad()
    for b_data in splitted_data:
        ...
        out = model(b_data)
        err = loss_fn(out, gt) / number_of_batches
        err.backward()
    optimizer.step()

for epoch in range(epochs):
    optimizer.zero_grad()
    out = model(all_data)
    err = loss_fn(out, target)
    err.backward()
    optimizer.step()

imxtx · October 19, 2022, 7:43am

I think the accumulated total error in the first example equals the error in the second example divided by number of batches in the first example.

soulitzer · October 19, 2022, 6:42pm

@imxtx is correct, you should not need to divide by number_of_batches in the first example. The amount of accumulated gradient is proportional to the batch size, so the two would be equivalent without this extra scaling.