Binary classification with imbalanced dataset


I’m working on a binary text classification problem. I have a pretty imbalanced dataset. The statistics are shown below.

train positive 9598
negative 30988
val positive 1200
negative 3874
test positive 1200
negative 3874

When I searched online and the forums, I came across a couple of methods to deal with this. I have a couple of questions about them. For this problem, I am using the BCEWithLogitsLoss which has a weight parameter and pos_weight parameter.

  1. What is the difference between weight and pos_weight and which one should I use for this problem?
  2. For pos_weight the example tells to pass a ratio between the sizes of the positive and negative class. How will the function know which is the positive and the negative class when using the weight?
  3. What do I pass for the weight parameter?
  4. Will using PyTorch’s WeightedRandomSampler help in this case? Also, I don’t understand the example given for this. Again, what do I pass for the parameters?

Please let me know if additional information is required and thanks for the help.

1 Like

Hi, if you read docs, weight just weights batch-wise. Theoretically you have no knowledge about what does you batch contains. I would go forward pos_weight in order to balance the contribution of each class.

pos_weight will be used when the specific class were chosen as GT (I think, didn’t look at source code)

Sampling with assigning probabilities to each class is another different way of adressing this problem. However you don0t have to use both