Data Imbalance handling

How can I handle data imbalance by updating the loss functions or by biasing the batch pass?

It will be great if some excerpts of the code can be provided. :slight_smile:

You could pass a weight Tensor to your loss (e.g. NLLLoss) with your class weights. It should have the length of your classes C.

2 Likes

I believe that if you use the weight Tensor then shuffle=True is not an option as far as I saw a few days ago though I would hope to be corrected.

1 Like

The weight in a loss function has no impact on the shuffle option from a DataLoader.
Could it be you are referring to the sampler argument in DataLoader?
The doc states that shuffle cannot be set when a sampler was given.

To sample in a balanced way, you could you a WeightedRandomSampler with its class weights.

However, a weighted loss should work regardless of the sampling method.

Absolutely, you are correct and my apologies for poor, presumptuous reading and false statements.
Perhaps, after my time spent to view the Sampler only to find out it could not be used as per my setup I was
hoping to save someone else said time.

If balancing data through loss functions works as well and allows shuffling why would this method rather than the Sampler be available, short of rolling one’s own?

No worries, sometimes it’s quite easy to mix something up. Especially when the arguments have the same name. :wink:

Well, the weights in the loss function perform a class weighting, i.e. samples get multiplied with the respective class weight. Have a look at the formula in the docs.

The WeightedRandomSampler draws samples according the the weights, so that some classes are sampled more often than others.

In the case of an imbalanced setup, you could try either of these approaches.

@ptrblck @smth
I found it is time-consuming on calculating weights for this WeightedRandomSampler(weights, sample_size).

As people said, len(weights) == len(dataset). so if the dataset is huge (10^9), this part is inefficient

for index in range(len(dataset)):
    reciprocal_weights.append(class_prob[dataset[index][1]])
weights = (1 / torch.Tensor(reciprocal_weights))

Is it really necessary to calculate all weights (10^9) if my sample_size is just very small (e.g. 1000)?
I don’t think so, especially the weights array usually is just a vector with lots of redundancy which can be express as the sparse vector.

The for loop might indeed be slow.
Could you try to use direct indexing like in this example?

Currently the weights are passed to torch.multinominal, so I guess there is no easy alternative at the moment. I get your issue and like the idea of just providing class weights. Let me think about a good approach.