I am building a Bert Binary classification with imbalanced data set (Y: 1 vs. N 9 ratio).
I read somewhere else that I have two options:
- Calculate weighted loss using nn.BCEWithLogitsLoss(pos_weight = n_negative/n_positive). – i tried weight from 9 to 2
- Oversampling the minority class (class 1 in my case) .
First to confirm - we don’t usually use the method 1 & 2 together right? We pick only one of it.
My evaluation metric of my model is precision but I also don’t want a too low recall. Previously if I don’t use any of those two methods, my valid precision is around 54% while recall can be lower than 10%. Later on after I try method 1, my recall go up to 30% while my precision go down to 30% too.
So i am now trying the second method:
Here’s my question:
Do you use the random sampler for Train set only or for Train, Valid and Test data set?
Currently my loss function is under my model.py. So if I use method 2 - theoretically I should use unweighted loss calculation. However, in that case, if I only use the WeightedRandom Sampler on Train data set, then my valid and test data set will still have 1/9 ration while using unweighted loss calcualtion.
If I use the WeightedRandom Sampler on all three data set, then I think it’s going to be biased result, since future data to predict may still have 1/9 ratio not the oversampling ratio.
Can you help me to understand how should i oversampling my data? Thank you so much!
Here’s how i calculate loss in the model.py
review_outputs = self.bert( review_input_ids, attention_mask=review_attention_mask, token_type_ids=review_token_type_ids, position_ids=None, head_mask=None, inputs_embeds=None, ) feature = review_outputs logits = self.classifier(feature).squeeze() outputs = (logits,) # + outputs[2:] # add hidden states and attention if they are here if labels is not None: pos_weight=torch.tensor(9.0) loss_fct = nn.BCEWithLogitsLoss().cuda() #pos_weight=pos_weight, will add this if use unweighted loss loss = loss_fct(logits, labels) outputs = (loss,) + outputs
And here is how I decide the sampler: Only list the Train one since the Valid and Test will be similar
target = train_dataset[:]['label'] class_sample_count = np.unique(target, return_counts=True) class_sample_count = class_sample_count * 2 weight = 1. / class_sample_count samples_weight = weight[target] samples_weight = torch.from_numpy(samples_weight) train_sampler = torch.utils.data.WeightedRandomSampler(samples_weight, len(samples_weight)) train_data_loader = DataLoader(train_dataset, batch_size=batch_size, sampler = train_sampler, num_workers=args.num_workers, drop_last=True)