Why use batches?

Is there any advantage or reason to use batches instead of one single batch except for being less computationally expensive?

It would depend. Are you doing training or inference?

If you are doing training, batches can help you shorten training times since you will be processing multiple images at once and updating the network according to the loss in all of them. This way, the network may be able to update itself better according to the task since it will be looking at multiple examples of the input at the same time. Here is a nice article that may help you understand why using batches bigger than one may be beneficial when training NNs. TL;DR: Batch normalization, or batchnorm for short, is proposed as a technique to help coordinate the update of multiple layers in the model.

If you are using your model after training, it is very common to use a batch size of one since you may be limited by the rate of how fast your inputs are coming. For example, in some vision tasks a model may take it’s input from a camera capable of outputting 30FPS, and depending of the task, speed might be an important factor. So instead of, let’s say, process a batch size of 30 every second, the model takes the last image given by the camera and output a prediction based on it taking much less time.

Another advantage is that (as you already said), once your model is trained you can achieve a similar performance with much less powerful hardware since the computation expense is much lower.

I hope this helps!

Using batches during training comes about as a tradeoff b/w two competing objectives.
Firstly, in terms of the learning algorithm, we would ideally want to backpropagate on every training example. This would allow the optimizer to slightly tweak the NN to best fit every training example. However, this methodology has the obvious drawback of being computational inefficient.

On the other hand, we can compute the loss on the entire dataset and backpropagate only once every epoch. This approach would be more computationally efficient, utilizing the benefits of parallelization. But, this would accumulate the loss for the entire dataset and backpropagate only once every epoch. By using a single batch we would only be able to update once for a pass through the entire dataset and this can limit the network’s ability to learn (especially the edge cases).

Thus, using batches provides the best compromise in terms of computational efficiency and the number of updates required per epoch for the network to best approximate the training data. I would further encourage you to read up on batch gradient descent v/s mini-batch gradient descent.

Also by having multiple batches the optimizer can update the NN multiple times every epoch. This can reduce the overall training time by reducing the total number of epochs required for reaching convergence.

Another thing to consider would be the added memory overhead that is associated with large batch sizes. This can be a massive hindrance, particularly while training on GPUs with limited VRAM.

Hope this helps.


You could always try training on 1 batch if you have a CS-1 wafer-scale accelerator https://arxiv.org/pdf/2010.03660.pdf :eyes: