How to handle imbalance dataset on text classification

smi222 · September 29, 2020, 12:46pm

Hello Everyone,

I am working on binary text classification problem. How do i apply smote or WeightedRandomSample for the imbalance in my dataset.My code currenlty looks like this:

class GDataset(Dataset):

  def __init__(self, passage, targets, tokenizer, max_len):
    self.passage = passage
    self.targets = targets
    self.tokenizer = tokenizer
    self.max_len = max_len
  
  def __len__(self):
    return len(self.passage)
  
  def __getitem__(self, item):
    passage = str(self.passage[item])
    target = self.targets[item]
    
    if (target == 1) and self.transform: # minority class
            x = self.transform(x)

    encoding = self.tokenizer.encode_plus(
      passage,
      add_special_tokens=True,
      max_length=self.max_len,
      return_token_type_ids=False,
      pad_to_max_length=True,
      return_attention_mask=True,
      return_tensors='pt',
    )
    return
      'passage_text': passage,
      'input_ids': encoding['input_ids'].flatten(),
      'attention_mask': encoding['attention_mask'].flatten(),
      'targets': torch.tensor(target, dtype=torch.long

How can i apply other balancing techniques

ptrblck · September 30, 2020, 9:14am

If each sample contains a single target value, you should be able to directly use the WeightedRandomSampler as described in this example.