what is the result without any dropout?
may I know what’s the difference between your model vs. baseline model? what did you modify from baseline model?
When using under or oversampling the results are more or less the same with a very small difference
Massive overfitting after 1 to 2 epochs
That’s the interesting part. I am using the “facebook hateful memes dataset” uploadee on kaggle. They provided some baseline metrics from multi/crossmodal models on the challenge website.
My model is a simple bi-encoder (vit for the images, bert for the linguisric component) with a late fusion via a linear layer applied to the concatenated outputs of both
Interestingly i tried a second custom model on the same dataset. The second model is a combined encoder having both vit-embeddings of images and Bert-embeddings of texts as positions. Over the checkpoints/epochs, the model achieves up to 99.9% auroc on train
But only auroc values of about 50% (and sometimes even under 50%?!) on the validation data.
are you referring to this challenge?
I browsed through the article and saw a table showing current test auroc numbers of various test methods: in summary
human => 82.65%
multimodal (with various configurations) => 64.75% ~ 71.41%
I did not see any multimodal configuration test that achieves near to auroc 80%. Could you point me to the reference you mentioned that baseline method achieved 80% auroc?
I am using the challenge winners as baseline. I don’t expect to obtain better results than them, but my goal is to have at least similar results (~75 auroc).
They are listed here: some of them use data enrichment, some not
so obtaining the results is at least -possible-
hm it doesn’t seem to lead to any improvement.
out of curiosity: what’s the reason/intuition behind trying SGD instead?
“My model is a simple bi-encoder (vit for the images, bert for the linguisric component) with a late fusion” => if we check the table the “late fusion” performance is 64.75% (if your model is the same category of this entry, then your 64% already achieved almost the limit of the expected performance)
challenge winners baseline 84.5% => interestingly this performance is still bounded by < 85% as predicted in previous post due to training set bias. The reason the winner achieved a higher than “late fusion” model performance is “feature extraction, web entity detection, human race detection. Those tags will give the transformer models much more diverse information”. This is equivalent saying “extend both training/validation dataset feature vector space to higher linearly independent dimensions such that to make more percentage of validation dataset feature vectors linear separable in the extended feature vector space”
hopefully this makes sense ; )
i will try to extract some further features and come back to report my findings =)
thank you all for the vivid discussion and the advices!
I really enjoyed this