Still overfitting, no matter how strong i regularize

Hi all,
my net (a multimodal transformer for text and vision) vastly overfitted the validation set after 4 epochs. I added all kinds of combinations of different dropout strengths, weight decays, learning rate schedulings, and so on…

the only thing that changed is the amount of epochs until the train and eval scores converge to the same values. example:

without any regularization, best scores: epoch 4: train roc auc 99%, val roc auc: 64%

with 40% dropout, weight decay, best validation scores along all epochs: epoch 30: train roc auc 99%, val roc auc: 64%

This is just one example. I tried many different regularization strengths and i always end up having such bad scores on the validation data. The only thing that varies is the amount of epochs to reach that.

the baseline method achieves validation scores of 80% roc auc…so there is definitely space for improvement…

What am I doing wrong?
Do you have any suggestions?

Kind Regards,
Milan

  1. How many samples are in your dataset?
  2. Have you tried using torch.vision transforms to augment the image data during training(shut off during eval)?
  3. During eval, are you using:
with torch.no_grad():
    model.eval()
    ... # eval steps
model.train()

Sounds like an issue of too little data and no transforms.

1 Like
  1. only about 8000 training observations but the baseline model used the same data.
  2. I use ViT(visual transformer) for extracting a semantic representation. it comes with its own feature extractor that normalizes and rescales the images
  3. yes:
    bi_encoder.eval()
    for batch in loader:
    \twith torch.no_grad():
    \t\tproba = bi_encoder(batch)

how do you generate validation dataset?

does it come from the 8000 training dataset?

no i split them up before the training process and stored both the validation and the training set on my harddrive. they are 100% disjoint

there is no magic in Neural Network except it works in very high dimensional simplex plane for classification or regression (ultimately it’s a linear function fit and region cut). If you are so lucky that your validation dataset happens to be linearly “perpendicular” to your training dataset in the high dimensional space. Then no matter how you train your model, the validation result will be poor by definition. To prove the point, you might swap your validation set and training set and see the phenomena… (well the above thought is just a conjecture not a theorem for your case, because I have very little information about your model… ; ))

What percent of each category are in your validiation set?
I.e. 10% cats, 10% dogs, etc.

the baseline model is trained on the same training set and evaluated on the same training data (not by me but by facebook ai) so unfortunately we can withdraw that theory

but i never thought of training it on the validation data and testing it on the training data - what kind of experimental results do you expect and what exactly would we derive from that ?

thank you for the input =)

it’s a binary classification problem and the validation set is perfectly balanced (250 of class 0, 250 of class 1)

the training set is a bit screwed. about 35% are of class 1 there. I try to combat that using the WeightedRandomSampler to achieve implicit oversampling. beforehands I tried undersampling by just discarding the overage randomly. both results in more or less (± 1%) same results

You have a low sample size. I would really up the augmentation and set the dropout to 50%. Use as many transformers as you can before putting the images into the model.
I.e. ColorJitter, RandomAffine, RandomResizedCrop, RandomHorizontalFlip, RandomPerspective, RandomRotation, GaussianBlur, RandomInvert, RandomPosterize, RandomSolarize, etc.
https://pytorch.org/vision/stable/transforms.html

You can’t go wrong with too many transforms, especially with low data. Nvidia’s research team has found aggressive transforms produce excellent results for ML vision.

One other question, how many parameters are you using for the entire model? For example:
print(sum(p.numel() for p in model.parameters()))

1 Like

the number of parameters is really high. in total it consists of 2 bert models plus a linear layer of size 768x2x2.
so that’s still a huge capacity.

i didn’t want to decrease the models depth, because i wanted to take advantage of the pretrained parameters (i load both of them from the huggingface model hub)

tried to lower the capacity usin weight decay and dropout

should i remove layers?

Right. I think for this smaller dataset, you’re not really justified in having such a large model. Larger models may tend to overfit faster, unless you really up the dropout and augmentation to counter that tendency.

1 Like

hmm i performed some runs with smaller networks (less transformer blocks/less depth) but we still have no score improvement… :confused:

According to your conversations, my limited understanding is your training dataset skew (35% 1 and 65% 0) and your validation dataset skew (50% 1 and 50% 0), plus the “best performance” of baseline model performance is 80% (?given the same validation dataset or different validation dataset? what’s the validation dataset skew for baseline model? also 50%/50%? or different?)

Then your model actually already achieved almost the “best performance” given the training dataset skew: => train roc 99% means your model will predict pretty much similar to what your training dataset skew is, which is 35% 1 and 65% 0 (This is what it means by training => you tell the model what the universe looks like: 35%/65% skew!) Then when you validate the dataset with 50%/50% skew with 80% best performance => what your model best case prediction statistics is (35% 1 and 65% 0) * 80% => (15% 0 False negative) * 80% => 85% * 80% val roc => 68% val roc

You reported that you achieved 64% val roc which is already very close to 68%.

If you reverse val and train dataset (i.e. use 50%/50% skew to train the model and test it with 35%/65% skew), then the result val roc might be approaching similar result of 68% val roc.

The extreme case would be using training skew of 1%/99% or even 0%/100%, then I think you might know the result of val roc with 50%/50% will be very poor…

Again, this is a super simplified analysis and just try to reveal the nature of “linearity” of current NN technology limitation: NN is good for interpolation and not good at extrapolation… ; (

So under the assumption that your network and training has no bugs, then I would suggest to “tune your training set to be roughly 50%/50% skew”, then test again to see if your val roc improve or not…

(Disclaimer: again the above thoughts is not a theorem but a conjecture that deserves further validation ; ))

1 Like
  1. yes the training data has a 35%/65% imbalance, the validation data is perfectly balanced.
  2. yes the baseline model is (according to the authors) evaluated on the exact same validation data

but effectively i have a less skewed dataset due to oversampling (and i tried undersampling before to achieve a perfectly balanced training set) but I achieve more or less (maybe 1% difference in both cases) the same results

:confused:

what’s your optimizer? optimizer parameters?

I tried Adam and AdamW with varying learning rates from a range between 10**-2 to 10**-6, and varying weight decays.

turns out that in my case an initial learning rate of 10**-4 seems to be the best solution.

moreover, i use a learning rate scheduler that lowers the learning rate via lr_i = 0.9*lr_{i-1} but it doesn’t seem to have an impact on the model performance and overfitting

can you please try it with SGD and see if you get the same overfitting? plus can you remove the scheduler and the learning rate update you have used?

1 - I would loose all learning rate scheduling and see if there is a difference between both optimizers with a few different parameters
2 - once you know which one is best in this case add back the LR.

i will perform some runs with sgd, brb

also - what are the results when all categories are the same size? it’s a classification problem, right? try out when all categories have the same size of dataset, 4 epochs or so.