I have a question.
For pytorch0.3.0s.
(I)I use the following code:
total_loss = net([batch_size, input_size]) #means all batch_size train samples
total_loss.backward()
optimizer.step()
(II)I use the following code:
total_loss = 0
for i in range(batch_size):
loss = net([(i)input_size]) #means i train sample
totle_loss += loss/batch_size
total_loss.backward()
optimizer.step()

Can you tell me any differences between (I) and (II), that means I use “for loop” to throw input to neural network step by step, and I want to know the final results is different? or maybe only training time is longer.

Usually you are using a criterion to calculate the loss. In the default setting the loss is averaged by the batch size, so that you don’t have to scale it.
In your example the model returns directly the loss, which is also valid, but I’m not sure, if you need to scale it.
You should definitely check if it’s right.

Back to your original question.
In your first example you are using batch gradient descent, i.e. all training samples are used for one weight update. Each iteration is thus one epoch.
This procedure might be OK, but it has been shown, that mini-batch gradient descent generalizes better in a lot of use cases.

I’m not sure, if your second example uses only one single example or a few.
If it’s the former, you are applying stochastic gradient descent. You will have much more weight updates in your epoch, since each sample will give you gradients.
The latter case is called mini-batch gradient descent and is usually the way to go.
You use a specific batch size and update your model using the loss of this batch.
Some layers like BatchNorm need a certain amount of samples to accurately estimate the running statistics, i.e. mean and var. It’s usually approx. 64 samples, but might differ based on your problem.

The final results of the training will most likely differ! The batch size is one hyperparameter which can change your model’s performance.

Thanks a lot.
And What I mean is for mini-batch gradient descent, My first code is to use batch_size examples, and I use the criterion to calculate average loss.
My second code is also to use batch_size examples. and the difference is that every time i calculate loss of example of batch_size examples, (means loss_1, loss_2, and loss_batch_size), finally, I add them and average. Then I use backward and optimizer to update.
If I do not use batch normalization or some other which may influence the loss. May be the result is the same?

Because the “for loop” code also calculate the average loss of batch_size examples. the gradient is depend on loss. the loss between code1 and code2 is the same? so the gradient is the same? I guess.

Do you use the same batch size to calculate the different losses?
If not, I’m not sure about the effects of this approach.

However, even without BatchNorm, your results will most likely depend on the batch size.
Maybe the results will be approx. the same, but the loss curve will look different.

Your code is not formatted properly, but it seems you are trying to do the following:

criterion = nn.MSELoss()
model = nn.Linear(10, 1)
x = torch.randn(10, 10)
y = torch.randn(10, 1)
# Use whole batch
output1 = model(x)
loss1 = criterion(output1, y)
loss1.backward()
print(model.weight.grad)
model.zero_grad()
# Use mini-batches
loss2 = 0
for x_, y_ in zip(x, y):
x_.unsqueeze(0)
y_.unsqueeze(0)
output2 = model(x_)
loss2 += criterion(output2, y_)
# Scale the loss, since we accumulated it
loss2 /= len(x)
loss2.backward()
print(model.weight.grad)

Yes, the gradient should be the same.
In my example I used single example in the for loop, so the scaling was straightforward.
If you use a batch size > 1, you would have to divide by the number of batches, if you use an averaged loss.
Your loss might differ, if your batch size is not constant, because the last batch is smaller for example.
Using a DataLoader, you could just drop the last batch, if it’s smaller, using drop_last=True.

I thought we cannot do inplace loss update like loss2 += criterion(output2, y_) or am I wrong and I can do it in loop without any issue (I am not considering the speed efficiency)?

also just out of curiosity why the second case that you wrote is different from this

# Use mini-batches
loss2 = 0
for x_, y_ in zip(x, y):
x_.unsqueeze(0)
y_.unsqueeze(0)
output2 = model(x_)
loss2 = loss2 + criterion(output2, y_)
loss2 /= len(x)
loss2.backward(retain_graph=True)
# Scale the loss, since we accumulated it
print(model.weight.grad)
``