Still overfitting, no matter how strong i regularize

it’s a binary classification problem and the validation set is perfectly balanced (250 of class 0, 250 of class 1)

the training set is a bit screwed. about 35% are of class 1 there. I try to combat that using the WeightedRandomSampler to achieve implicit oversampling. beforehands I tried undersampling by just discarding the overage randomly. both results in more or less (± 1%) same results

You have a low sample size. I would really up the augmentation and set the dropout to 50%. Use as many transformers as you can before putting the images into the model.
I.e. ColorJitter, RandomAffine, RandomResizedCrop, RandomHorizontalFlip, RandomPerspective, RandomRotation, GaussianBlur, RandomInvert, RandomPosterize, RandomSolarize, etc.
https://pytorch.org/vision/stable/transforms.html

You can’t go wrong with too many transforms, especially with low data. Nvidia’s research team has found aggressive transforms produce excellent results for ML vision.

One other question, how many parameters are you using for the entire model? For example:
print(sum(p.numel() for p in model.parameters()))

1 Like

the number of parameters is really high. in total it consists of 2 bert models plus a linear layer of size 768x2x2.
so that’s still a huge capacity.

i didn’t want to decrease the models depth, because i wanted to take advantage of the pretrained parameters (i load both of them from the huggingface model hub)

tried to lower the capacity usin weight decay and dropout

should i remove layers?

Right. I think for this smaller dataset, you’re not really justified in having such a large model. Larger models may tend to overfit faster, unless you really up the dropout and augmentation to counter that tendency.

1 Like

hmm i performed some runs with smaller networks (less transformer blocks/less depth) but we still have no score improvement… :confused:

According to your conversations, my limited understanding is your training dataset skew (35% 1 and 65% 0) and your validation dataset skew (50% 1 and 50% 0), plus the “best performance” of baseline model performance is 80% (?given the same validation dataset or different validation dataset? what’s the validation dataset skew for baseline model? also 50%/50%? or different?)

Then your model actually already achieved almost the “best performance” given the training dataset skew: => train roc 99% means your model will predict pretty much similar to what your training dataset skew is, which is 35% 1 and 65% 0 (This is what it means by training => you tell the model what the universe looks like: 35%/65% skew!) Then when you validate the dataset with 50%/50% skew with 80% best performance => what your model best case prediction statistics is (35% 1 and 65% 0) * 80% => (15% 0 False negative) * 80% => 85% * 80% val roc => 68% val roc

You reported that you achieved 64% val roc which is already very close to 68%.

If you reverse val and train dataset (i.e. use 50%/50% skew to train the model and test it with 35%/65% skew), then the result val roc might be approaching similar result of 68% val roc.

The extreme case would be using training skew of 1%/99% or even 0%/100%, then I think you might know the result of val roc with 50%/50% will be very poor…

Again, this is a super simplified analysis and just try to reveal the nature of “linearity” of current NN technology limitation: NN is good for interpolation and not good at extrapolation… ; (

So under the assumption that your network and training has no bugs, then I would suggest to “tune your training set to be roughly 50%/50% skew”, then test again to see if your val roc improve or not…

(Disclaimer: again the above thoughts is not a theorem but a conjecture that deserves further validation ; ))

1 Like
  1. yes the training data has a 35%/65% imbalance, the validation data is perfectly balanced.
  2. yes the baseline model is (according to the authors) evaluated on the exact same validation data

but effectively i have a less skewed dataset due to oversampling (and i tried undersampling before to achieve a perfectly balanced training set) but I achieve more or less (maybe 1% difference in both cases) the same results

:confused:

what’s your optimizer? optimizer parameters?

I tried Adam and AdamW with varying learning rates from a range between 10**-2 to 10**-6, and varying weight decays.

turns out that in my case an initial learning rate of 10**-4 seems to be the best solution.

moreover, i use a learning rate scheduler that lowers the learning rate via lr_i = 0.9*lr_{i-1} but it doesn’t seem to have an impact on the model performance and overfitting

can you please try it with SGD and see if you get the same overfitting? plus can you remove the scheduler and the learning rate update you have used?

1 - I would loose all learning rate scheduling and see if there is a difference between both optimizers with a few different parameters
2 - once you know which one is best in this case add back the LR.

i will perform some runs with sgd, brb

also - what are the results when all categories are the same size? it’s a classification problem, right? try out when all categories have the same size of dataset, 4 epochs or so.

what is the result without any dropout?

Interesting…

may I know what’s the difference between your model vs. baseline model? what did you modify from baseline model?

When using under or oversampling the results are more or less the same with a very small difference

Massive overfitting after 1 to 2 epochs

That’s the interesting part. I am using the “facebook hateful memes dataset” uploadee on kaggle. They provided some baseline metrics from multi/crossmodal models on the challenge website.

My model is a simple bi-encoder (vit for the images, bert for the linguisric component) with a late fusion via a linear layer applied to the concatenated outputs of both

Interestingly i tried a second custom model on the same dataset. The second model is a combined encoder having both vit-embeddings of images and Bert-embeddings of texts as positions. Over the checkpoints/epochs, the model achieves up to 99.9% auroc on train

But only auroc values of about 50% (and sometimes even under 50%?!) on the validation data.

are you referring to this challenge?

I browsed through the article and saw a table showing current test auroc numbers of various test methods: in summary

human => 82.65%
multimodal (with various configurations) => 64.75% ~ 71.41%

I did not see any multimodal configuration test that achieves near to auroc 80%. Could you point me to the reference you mentioned that baseline method achieved 80% auroc?

Thanks.

I am using the challenge winners as baseline. I don’t expect to obtain better results than them, but my goal is to have at least similar results (~75 auroc).
They are listed here: some of them use data enrichment, some not
so obtaining the results is at least -possible- :confused:

hm it doesn’t seem to lead to any improvement.
out of curiosity: what’s the reason/intuition behind trying SGD instead?