To overcome a memory problem I split my training data into batches and process it in a sequential manner instead of feeding the whole data into the network.
But I get different gradients between the first and second examples given below.
I think that they are mathematically equivalent and should give the same result. Сorrect me if i’m wrong.
1.
for epoch in range(epochs):
optimizer.zero_grad()
for b_data in splitted_data:
...
out = model(b_data)
err = loss_fn(out, gt) / number_of_batches
err.backward()
optimizer.step()
for epoch in range(epochs):
optimizer.zero_grad()
out = model(all_data)
err = loss_fn(out, target)
err.backward()
optimizer.step()