How does the multinomial sampling work?

JiaMingLin · January 6, 2023, 3:54am

Hi,
I am dealing an imbalanced dataset, say #(postive) = 1K and #(negative) = 50K.
Then I find a library imbalanced-dataset-sampler(which is based on torch.multinomial) to resample to reduce the skewness.

For the basic usage, it pass and array of data weight to torch.multinomial then return the sampled indices(with replacement).

Example

# weight for each data point, 2e-5 = 1/#(negative), 1e-3 = 1/#(positive)
weights = [2e-5, 1e-3, 2e-5, ...] 
sampled_indices = torch.multinormail(
                           weights, 
                           num_samples = (num_pos+num_neg),
                           replacement = True
)

However, when I query original dataframe with this sampled indies, the #(positive) and #(negative) are almost equal. I am wondering the reason behind, can someone give me some tips?
Moreover, I would like to adjust the pos to neg ratio, how do I implement with torch.multinomial?

Thanks.

ptrblck · January 6, 2023, 6:49am

This would be the desired outcome as the sampler would try to balance the class indices in the batch.
The reason is that samples with a higher weight will be sampled more likely compared to samples with a lower weight. Since you are defining the weight as the class frequency the minority class samples will be oversampled.