Here’s my scenario. I have abundance of data, tens of millions of diverse samples, proper dataset hygiene maintained, train/test/val partitioning and as good statistical distrubution as data allows, custom datagenerator maintains that distrubution throughout the process.
Network itself is rather, i presume, small, couple million parameters and performs convolutional multiclass classification. Batch size 32 training occupies 97% of 16G of GPUs memory and takes 10 hours per 10% of dataset.
The problem i encounter happens early in training. At around 5-8% of dataset the network begins overfitting, i diagnose that with regular snapshots. Typical stalling of test accuracy while test loss ramps up. The problem resists typical solution, batch normalizations, various dropout strategies, hyper parameter exploration fails as well.
From starting point to now i managed with architecture changes to get the network to solve test data from top-1 15% → 30% before overfitting occurs, with overfitting occuring at 5% → 8% of dataset.
Since training is rather time costly and already matches my cards capabilities, the approach i will decide to take will take around two weeks to fully explore, so i’m looking for guidance which route might be better to take first from people with experience with non trivial projects.
Here’s the two divergent approaches i consider
- Decrese the batch size to 24/16 and try to explore wider architectures that 16G GPUs allows.
- Increse computational complexity/depth of network.
- Any insight/suggestion will be welcome.
You could try the various regularization techniques described here.
Thanks for input. Until now i managed to run 4 different regularization schemes, L1, L2, L1L2 on different layers, just the output layers and whole model, with default textbook parameters and some wild guesses. No change in outcome, regularization has no impact besides wild loss output in the begining of training.
I think i’ll give up on that and get back to making the net wider xor deeper.
Unless you have any other ideas?
Making the network bigger will likely only make the overfitting worse. One way to think about overfitting is that your network has so many parameters that it “memorizes” the training data. That is, the network has so much “memory”—in the form of parameters that it can adjust while learning—that it remembers that input
x has output
y, for lots of (input, output) pairs
(x,y) in the training set. So it does very well on the training set. But since it has not memorized anything about the test set, it fails miserably on the test set.
So, making the network larger will only give it more room to remember the training set, and make the overfitting worse. Every regularization technique that you tried can be thought of as a way of limiting the amount of things that the network can memorize.
The better idea may be to make the network smaller in some manner.
After some work, i’ve manage to get the model over 35% top1 accuracy on test and validation with overfitting showing up at past 18% of dataset. I didn’t want to discount “making network smaller” before i found the next stepping stone, as this is a generic intuition of what works in neural networks.
What worked for me was an autoencoder, tranied on the same training data and put at the begining of final network while being set to frozen. So i guess i took a chapter from transfer learning.
Coming back to original question, i guess i’m now looking for other building blocks that people might have employed in their succeses breaking barriers in their nets.
Thank you for sharing what you found. I haven’t used autoencoders so far, but what you learned definitely goes into my “anti-overfitting toolkit”!