I would like to understand which of the 2 options is a better approach to deal with an imbalanced dataset.
I’m dealing with a dataset that has:
- 516 items in class 0
- 91 items in class 1
- 622 items in class 2
- 592 items in class 3
My data has way less items of class 1 in comparation to the other classes.
Searching a little i found that the usual approach to this kind of problem is:
torch.optim.WeightedRandomSampler() where i sample with probability (weights)
torch.nn.CrossEntropyLoss(weights) (my problem is classification) where i weight each class for the update part
I couldn’t find an answer to which one of them is a better approach in which cases, and i would be very thankful if someone could explain to me, even if it doesn’t actually metter which one to choose.
My own personal theory, for which I have absolutely no evidence,
says that if
WeightedRandomSampler is likely to give you a
batch with duplicate samples from your underrepresented class
(the class with fewer samples), you’re just wasting computation
by running the duplicates through your model, so you might as
well use class weights in your loss function.
This is probably also true if you are likely to get duplicates over
the course of a couple of batches for which the weights in your
model haven’t changed much.
But if you aren’t likely to have duplicates, the gradient you calculate
will “more representative” if your loss contains, say, five different
samples from your underrepresented class than if it contains only
one such sample, weighted by a factor of five.
Where you cross over from preferring loss-function weights to
WeightedRandomSampler, I don’t really know.