Why does use of mini-batches have such effect on my classifier?

aarrvv · June 6, 2020, 10:20am

I am learning pytorch and coded a minimal classifier to play with:

import torch
import numpy as np
import matplotlib.pyplot as plt

numclasses, count = 8, 200
x = torch.randn(count, 4)
y = torch.randint(0, numclasses, size=[count])

dataset = torch.utils.data.TensorDataset(x, y)
dataloader = torch.utils.data.DataLoader(dataset, batch_size=20, shuffle = True) 

model =  torch.nn.Sequential(
    torch.nn.Linear(4, 20), 
    torch.nn.ReLU(), 
    torch.nn.Linear(20, numclasses),
    torch.nn.Softmax(dim=1))

lossf = torch.nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(model.parameters(), lr=1e-3)

losses = []
for epoch in range(30):
    for batch in dataloader:
        x, y = batch
        out = model(x) 
        loss = lossf(out, y)

        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        losses.append( loss.item() )

plt.plot(losses)

This is what I get with full dataset (batch_size = 200):

full

And this is with mini-batch (batch_size = 20):

minibatch

Any idea why this happens?

Could this ever happen in real problems with real data?

ptrblck · June 6, 2020, 10:34am

The larger the batch size, the less noise the parameter updates will include.
Often this noise is beneficial to reach a better final accuracy, but it might depend on your use case.
You could find some articles, which compare gradient descent, batch gradient descent, and stochastic gradient descent.

Chapter 5.2.4 in Pattern Recognition and Machine Learning might give you more information.

aarrvv · June 6, 2020, 10:43am

So the full-batch version is basically overfitting while the mini-batch is unable to do that (since our data is just noise and there are no real patterns to learn)?

Can this insight be of any use when working on real but very noisy data?

ptrblck · June 6, 2020, 10:56am

No, I don’t think you are seeing any overfitting here.
A model is overfitting, if the training loss is decreasing, while the validation loss is staying flat or increasing.

Yes, as the batch size might also be considered as a hyperparameter. Also, you could e.g. increase the batch size later in training to smooth the gradient etc.