Still overfitting, no matter how strong i regularize

milan_kalkenings · October 12, 2021, 11:37am

it’s a binary classification problem and the validation set is perfectly balanced (250 of class 0, 250 of class 1)

the training set is a bit screwed. about 35% are of class 1 there. I try to combat that using the WeightedRandomSampler to achieve implicit oversampling. beforehands I tried undersampling by just discarding the overage randomly. both results in more or less (± 1%) same results

J_Johnson · October 12, 2021, 1:30pm

You have a low sample size. I would really up the augmentation and set the dropout to 50%. Use as many transformers as you can before putting the images into the model.
I.e. ColorJitter, RandomAffine, RandomResizedCrop, RandomHorizontalFlip, RandomPerspective, RandomRotation, GaussianBlur, RandomInvert, RandomPosterize, RandomSolarize, etc.
https://pytorch.org/vision/stable/transforms.html

You can’t go wrong with too many transforms, especially with low data. Nvidia’s research team has found aggressive transforms produce excellent results for ML vision.

One other question, how many parameters are you using for the entire model? For example:
print(sum(p.numel() for p in model.parameters()))

milan_kalkenings · October 12, 2021, 1:48pm

the number of parameters is really high. in total it consists of 2 bert models plus a linear layer of size 768x2x2.
so that’s still a huge capacity.

i didn’t want to decrease the models depth, because i wanted to take advantage of the pretrained parameters (i load both of them from the huggingface model hub)

tried to lower the capacity usin weight decay and dropout

should i remove layers?

J_Johnson · October 12, 2021, 1:54pm

Right. I think for this smaller dataset, you’re not really justified in having such a large model. Larger models may tend to overfit faster, unless you really up the dropout and augmentation to counter that tendency.

milan_kalkenings · October 12, 2021, 8:16pm

hmm i performed some runs with smaller networks (less transformer blocks/less depth) but we still have no score improvement…

ynjiun_wang · October 12, 2021, 9:52pm

According to your conversations, my limited understanding is your training dataset skew (35% 1 and 65% 0) and your validation dataset skew (50% 1 and 50% 0), plus the “best performance” of baseline model performance is 80% (?given the same validation dataset or different validation dataset? what’s the validation dataset skew for baseline model? also 50%/50%? or different?)

Then your model actually already achieved almost the “best performance” given the training dataset skew: => train roc 99% means your model will predict pretty much similar to what your training dataset skew is, which is 35% 1 and 65% 0 (This is what it means by training => you tell the model what the universe looks like: 35%/65% skew!) Then when you validate the dataset with 50%/50% skew with 80% best performance => what your model best case prediction statistics is (35% 1 and 65% 0) * 80% => (15% 0 False negative) * 80% => 85% * 80% val roc => 68% val roc

You reported that you achieved 64% val roc which is already very close to 68%.

If you reverse val and train dataset (i.e. use 50%/50% skew to train the model and test it with 35%/65% skew), then the result val roc might be approaching similar result of 68% val roc.

The extreme case would be using training skew of 1%/99% or even 0%/100%, then I think you might know the result of val roc with 50%/50% will be very poor…

Again, this is a super simplified analysis and just try to reveal the nature of “linearity” of current NN technology limitation: NN is good for interpolation and not good at extrapolation… ; (

So under the assumption that your network and training has no bugs, then I would suggest to “tune your training set to be roughly 50%/50% skew”, then test again to see if your val roc improve or not…

(Disclaimer: again the above thoughts is not a theorem but a conjecture that deserves further validation ; ))

milan_kalkenings · October 13, 2021, 9:30am

yes the training data has a 35%/65% imbalance, the validation data is perfectly balanced.
yes the baseline model is (according to the authors) evaluated on the exact same validation data

but effectively i have a less skewed dataset due to oversampling (and i tried undersampling before to achieve a perfectly balanced training set) but I achieve more or less (maybe 1% difference in both cases) the same results

OrielBanne · October 14, 2021, 1:36pm

what’s your optimizer? optimizer parameters?

milan_kalkenings · October 14, 2021, 1:49pm

I tried Adam and AdamW with varying learning rates from a range between 10**-2 to 10**-6, and varying weight decays.

turns out that in my case an initial learning rate of 10**-4 seems to be the best solution.

moreover, i use a learning rate scheduler that lowers the learning rate via lr_i = 0.9*lr_{i-1} but it doesn’t seem to have an impact on the model performance and overfitting

OrielBanne · October 14, 2021, 1:53pm

can you please try it with SGD and see if you get the same overfitting? plus can you remove the scheduler and the learning rate update you have used?

1 - I would loose all learning rate scheduling and see if there is a difference between both optimizers with a few different parameters
2 - once you know which one is best in this case add back the LR.

milan_kalkenings · October 14, 2021, 1:55pm

i will perform some runs with sgd, brb

OrielBanne · October 14, 2021, 1:59pm

also - what are the results when all categories are the same size? it’s a classification problem, right? try out when all categories have the same size of dataset, 4 epochs or so.

OrielBanne · October 14, 2021, 2:07pm

what is the result without any dropout?

ynjiun_wang · October 14, 2021, 6:21pm

Interesting…

may I know what’s the difference between your model vs. baseline model? what did you modify from baseline model?

milan_kalkenings · October 14, 2021, 7:58pm

When using under or oversampling the results are more or less the same with a very small difference

milan_kalkenings · October 14, 2021, 7:59pm

Massive overfitting after 1 to 2 epochs

milan_kalkenings · October 14, 2021, 8:04pm

That’s the interesting part. I am using the “facebook hateful memes dataset” uploadee on kaggle. They provided some baseline metrics from multi/crossmodal models on the challenge website.

My model is a simple bi-encoder (vit for the images, bert for the linguisric component) with a late fusion via a linear layer applied to the concatenated outputs of both

Interestingly i tried a second custom model on the same dataset. The second model is a combined encoder having both vit-embeddings of images and Bert-embeddings of texts as positions. Over the checkpoints/epochs, the model achieves up to 99.9% auroc on train

But only auroc values of about 50% (and sometimes even under 50%?!) on the validation data.

ynjiun_wang · October 15, 2021, 2:54am

are you referring to this challenge?

I browsed through the article and saw a table showing current test auroc numbers of various test methods: in summary

human => 82.65%
multimodal (with various configurations) => 64.75% ~ 71.41%

I did not see any multimodal configuration test that achieves near to auroc 80%. Could you point me to the reference you mentioned that baseline method achieved 80% auroc?

Thanks.

milan_kalkenings · October 15, 2021, 11:34am

I am using the challenge winners as baseline. I don’t expect to obtain better results than them, but my goal is to have at least similar results (~75 auroc).
They are listed here: some of them use data enrichment, some not
so obtaining the results is at least -possible-

milan_kalkenings · October 16, 2021, 1:26am

hm it doesn’t seem to lead to any improvement.
out of curiosity: what’s the reason/intuition behind trying SGD instead?