Hello there,
I am building a Bert Binary classification with imbalanced data set (Y: 1 vs. N 9 ratio).
I read somewhere else that I have two options:
- Calculate weighted loss using nn.BCEWithLogitsLoss(pos_weight = n_negative/n_positive). – i tried weight from 9 to 2
- Oversampling the minority class (class 1 in my case) .
First to confirm - we don’t usually use the method 1 & 2 together right? We pick only one of it.
My evaluation metric of my model is precision but I also don’t want a too low recall. Previously if I don’t use any of those two methods, my valid precision is around 54% while recall can be lower than 10%. Later on after I try method 1, my recall go up to 30% while my precision go down to 30% too.
So i am now trying the second method:
Here’s my question:
Do you use the random sampler for Train set only or for Train, Valid and Test data set?
Currently my loss function is under my model.py. So if I use method 2 - theoretically I should use unweighted loss calculation. However, in that case, if I only use the WeightedRandom Sampler on Train data set, then my valid and test data set will still have 1/9 ration while using unweighted loss calcualtion.
If I use the WeightedRandom Sampler on all three data set, then I think it’s going to be biased result, since future data to predict may still have 1/9 ratio not the oversampling ratio.
Can you help me to understand how should i oversampling my data? Thank you so much!
Here’s how i calculate loss in the model.py
review_outputs = self.bert(
review_input_ids,
attention_mask=review_attention_mask,
token_type_ids=review_token_type_ids,
position_ids=None,
head_mask=None,
inputs_embeds=None,
)
feature = review_outputs[1]
logits = self.classifier(feature).squeeze()
outputs = (logits,) # + outputs[2:] # add hidden states and attention if they are here
if labels is not None:
pos_weight=torch.tensor(9.0)
loss_fct = nn.BCEWithLogitsLoss().cuda() #pos_weight=pos_weight, will add this if use unweighted loss
loss = loss_fct(logits, labels)
outputs = (loss,) + outputs
And here is how I decide the sampler: Only list the Train one since the Valid and Test will be similar
target = train_dataset[:]['label']
class_sample_count = np.unique(target, return_counts=True)[1]
class_sample_count[1] = class_sample_count[1] * 2
weight = 1. / class_sample_count
samples_weight = weight[target]
samples_weight = torch.from_numpy(samples_weight)
train_sampler = torch.utils.data.WeightedRandomSampler(samples_weight, len(samples_weight))
train_data_loader = DataLoader(train_dataset, batch_size=batch_size, sampler = train_sampler, num_workers=args.num_workers, drop_last=True)