Hi all,
my net (a multimodal transformer for text and vision) vastly overfitted the validation set after 4 epochs. I added all kinds of combinations of different dropout strengths, weight decays, learning rate schedulings, and so on…
the only thing that changed is the amount of epochs until the train and eval scores converge to the same values. example:
without any regularization, best scores: epoch 4: train roc auc 99%, val roc auc: 64%
with 40% dropout, weight decay, best validation scores along all epochs: epoch 30: train roc auc 99%, val roc auc: 64%
This is just one example. I tried many different regularization strengths and i always end up having such bad scores on the validation data. The only thing that varies is the amount of epochs to reach that.
the baseline method achieves validation scores of 80% roc auc…so there is definitely space for improvement…
What am I doing wrong?
Do you have any suggestions?
only about 8000 training observations but the baseline model used the same data.
I use ViT(visual transformer) for extracting a semantic representation. it comes with its own feature extractor that normalizes and rescales the images
yes:
bi_encoder.eval()
for batch in loader:
\twith torch.no_grad():
\t\tproba = bi_encoder(batch)
…
there is no magic in Neural Network except it works in very high dimensional simplex plane for classification or regression (ultimately it’s a linear function fit and region cut). If you are so lucky that your validation dataset happens to be linearly “perpendicular” to your training dataset in the high dimensional space. Then no matter how you train your model, the validation result will be poor by definition. To prove the point, you might swap your validation set and training set and see the phenomena… (well the above thought is just a conjecture not a theorem for your case, because I have very little information about your model… ; ))
the baseline model is trained on the same training set and evaluated on the same training data (not by me but by facebook ai) so unfortunately we can withdraw that theory
but i never thought of training it on the validation data and testing it on the training data - what kind of experimental results do you expect and what exactly would we derive from that ?
it’s a binary classification problem and the validation set is perfectly balanced (250 of class 0, 250 of class 1)
the training set is a bit screwed. about 35% are of class 1 there. I try to combat that using the WeightedRandomSampler to achieve implicit oversampling. beforehands I tried undersampling by just discarding the overage randomly. both results in more or less (± 1%) same results
You have a low sample size. I would really up the augmentation and set the dropout to 50%. Use as many transformers as you can before putting the images into the model.
I.e. ColorJitter, RandomAffine, RandomResizedCrop, RandomHorizontalFlip, RandomPerspective, RandomRotation, GaussianBlur, RandomInvert, RandomPosterize, RandomSolarize, etc. https://pytorch.org/vision/stable/transforms.html
You can’t go wrong with too many transforms, especially with low data. Nvidia’s research team has found aggressive transforms produce excellent results for ML vision.
One other question, how many parameters are you using for the entire model? For example: print(sum(p.numel() for p in model.parameters()))
the number of parameters is really high. in total it consists of 2 bert models plus a linear layer of size 768x2x2.
so that’s still a huge capacity.
i didn’t want to decrease the models depth, because i wanted to take advantage of the pretrained parameters (i load both of them from the huggingface model hub)
tried to lower the capacity usin weight decay and dropout
Right. I think for this smaller dataset, you’re not really justified in having such a large model. Larger models may tend to overfit faster, unless you really up the dropout and augmentation to counter that tendency.
According to your conversations, my limited understanding is your training dataset skew (35% 1 and 65% 0) and your validation dataset skew (50% 1 and 50% 0), plus the “best performance” of baseline model performance is 80% (?given the same validation dataset or different validation dataset? what’s the validation dataset skew for baseline model? also 50%/50%? or different?)
Then your model actually already achieved almost the “best performance” given the training dataset skew: => train roc 99% means your model will predict pretty much similar to what your training dataset skew is, which is 35% 1 and 65% 0 (This is what it means by training => you tell the model what the universe looks like: 35%/65% skew!) Then when you validate the dataset with 50%/50% skew with 80% best performance => what your model best case prediction statistics is (35% 1 and 65% 0) * 80% => (15% 0 False negative) * 80% => 85% * 80% val roc => 68% val roc
You reported that you achieved 64% val roc which is already very close to 68%.
If you reverse val and train dataset (i.e. use 50%/50% skew to train the model and test it with 35%/65% skew), then the result val roc might be approaching similar result of 68% val roc.
The extreme case would be using training skew of 1%/99% or even 0%/100%, then I think you might know the result of val roc with 50%/50% will be very poor…
Again, this is a super simplified analysis and just try to reveal the nature of “linearity” of current NN technology limitation: NN is good for interpolation and not good at extrapolation… ; (
So under the assumption that your network and training has no bugs, then I would suggest to “tune your training set to be roughly 50%/50% skew”, then test again to see if your val roc improve or not…
(Disclaimer: again the above thoughts is not a theorem but a conjecture that deserves further validation ; ))
yes the training data has a 35%/65% imbalance, the validation data is perfectly balanced.
yes the baseline model is (according to the authors) evaluated on the exact same validation data
but effectively i have a less skewed dataset due to oversampling (and i tried undersampling before to achieve a perfectly balanced training set) but I achieve more or less (maybe 1% difference in both cases) the same results
I tried Adam and AdamW with varying learning rates from a range between 10**-2 to 10**-6, and varying weight decays.
turns out that in my case an initial learning rate of 10**-4 seems to be the best solution.
moreover, i use a learning rate scheduler that lowers the learning rate via lr_i = 0.9*lr_{i-1} but it doesn’t seem to have an impact on the model performance and overfitting
can you please try it with SGD and see if you get the same overfitting? plus can you remove the scheduler and the learning rate update you have used?
1 - I would loose all learning rate scheduling and see if there is a difference between both optimizers with a few different parameters
2 - once you know which one is best in this case add back the LR.
also - what are the results when all categories are the same size? it’s a classification problem, right? try out when all categories have the same size of dataset, 4 epochs or so.