I am training a CNN on matrices of shape M x N. The input channel size is 1, as one sample of the matrix should be mapped to the output. So in principle, it is just like feeding a grayscale image to the CNN.

I am feeding data in format [batch_size,1,M,N], where 1 is the channel size, which is then “expanded” into more channels by the various Conv2d layers. The dataloader has shuffle=False, in case that matters.

Ultimately, in order to make the network “regressive”, i apply a nn.Flatten() layer, followed by two linear output layers. The nn.Flatten() transforms the data to format [batch_size, input_size_linear], where input_size_linear is the size of the flat layer resulting from the various conv and max pool operations.

So far so good. I have a simple question. I do not understand what difference the batch_size makes here? Setting batch_size=1 would load the pictures one by one, updating the weights after each image, while batch_size = 100 would just train on 100 pictures and then compute the gradient of the total loss of the 100 batches? Or are there any additional effects from changing the batch_size?

I think you are basically asking, and correct me if I’m wrong, what’s the purpose of minibatch.

The difference between batch_size = 1 and batch_size =100 is when using batch_size = 1 each image from the train set is fed to the network one by one and after each image you calculate the gradients according to the loss of this particular image and update the network.

While using batch_size = 100 you feed all the 100 images (which done simultaneously and saves a lot of time) and update according to the averaged loss of all the 100 images.

That way the gradient is averaged over 100 images and the direction of the step is more accurate.

In general, intuitively you would like to do a gradient step towards the global minimum of the loss function based on all the information you have in the training phase (that’s called gradient decent), i.e taking a step after feeding all the images to the network and afterwards average over the loss or in other words use batch_size = length(train_set). But as you can imagine this will consume a lot of time because you have to feed all the train set in order to make only one step. Another reason to use mini-batches is that it helps with avoiding local minima in the loss function.

In practice, gradient descent seems to work with mini-batches, and even with batch_size = 1 (there cases are called stochastic gradient descent).

Hello Yuv,
thanks for your detailed response. You confirm my thought that the reason for using batches is mostly optimization related. One final question: when using for instance a BatchNorm2d layer in the fully connected output layer, this means, that the data will be (for instance) demeaned and standardized by computing mean and std along the batch_size dimension, correct? As I understand. the BatchNorm2d layer learns the mean and std, in order to be able to normalize data out-of-sample. So as soon as we introduce such a layer, this introduces a stronger dependence of the model on the batch_size, right? Thanks! Best, JZ

Hey,
I’m glad that I could help.
You can read what BatchNorm2d does explicitly in the docs or in the main article that explains how batch normalization works.

As you can in the docs of the

The mean and standard-deviation are calculated per-dimension over the mini-batches.

I’m not sure about your statement that batch normalization introduces stronger dependence on the batch_size. From here it is my opinion only and you probably should check this yourself:
As far as I understand (although I don’t know which dataset you use) - the pixels of the images (I assume you use images) in the whole dataset should have some normal distribution with certain mean and variance. If that is correct one can expect that by randomly taking batches out of the dataset would result also a normal distribution with approximately the same mean and variance (Central Limit Theorem) . The approximation is dependent on the batch_size and is exact at batch_size equals to dataset size, but in practice it works fine even with smaller batches.

thanks again. I follow your argument for CLT on normally distributed dataset, which makes perfect sense. I figured my initial description of what I am doing is probably a bit misleading and incomplete. I wrote sth like " So in principle, it is just like feeding a grayscale image to the CNN." to simplify the description of the example, but, this is actually not true, so please forget about that ^^.

Instead, what I have is highly instationary, correlated time series data. I am using a sliding window approach to create fictitious “images”. Say, I have M time series. I use a sliding window of length N (“lookback”). From this, I create images of size MXN. Each row (corresponding to an individual time series) is min-max-scaled. I slide the window in steps of size S, in order to create batches of overlapping windows. My labels are some “future property” of the time series, following the sliding window.

My intention in this model is to detect patterns and interdependencies in the M correlated time series, while simultaneously detecting patterns over time. Obviously, the instationarity of the time series already implies that it is probably bad to apply batch normalization here, especially with overlapping windows, right? Although on the other hand the min max scaling at least confines the mean and variance of the individual samples a bit.

I am drifting a bit off topic here, but still, if you have some thoughts, I’d be interested to hear em, cause I am relatively new to applied ML and need some practitioner’s advice :)) Thanks!

If I understand correctly, the problem you describe seems to be very complicate and you probably should do some literature review and research about what people did in this field. Some tags that I would search are video object detection/recognition/tracking, recurrent CNN, etc.