Okay, now I understand your doubt.

See in first case you are predicting, calculating loss, calculating gradients, and applying gradient update for each batch, so total number of these operations happening for each epoch is equal to “`number_of_batches`

”, which can be calculated as `(len(X)//batch_size) + 1`

.

Whilst in second case you are applying these operations (predicting, calculating loss, calculating gradients, and applying gradient update) exactly 1 number of time for each epoch.

This is the first reason why second one is faster.

Another reason which is more related, is parallel computation. Computations occurred in parallel can be very fast than sequential, especially in case of GPU’s (that’s the reason why we use GPU’s in the first place).

And in the first case you are doing training by splitting the dataset in to batches. So training in this case would be sequential, means slower.

Whilst in the second case you are doing training by utilizing the whole X, Y at a time. So training in this case would be parallel, means faster.

These are the reasons by second case is faster.

But as you said, in second case loss wouldn’t converge as easily when compared to first case. That’s why we always split dataset into batches and then train the model. Otherwise there’s no point of better performance if loss is not converging

So don’t worry about performance regression you get by splitting the dataset, 'cause you’d always get better loss convergence.

Instead you apply the second case when validating.

See, in training we were splitting the dataset for better optimization. But while validating there’s actually no point of splitting the validation data. The only case you’d have to do so is when your validation data is way big, and you couldn’t store it in you memory. In that case only you split your validation data and use dataloader in order to validate your model.