How to use WeightedRandomSampler and nn.BCEWithLogitsLoss()

backpackerice · September 25, 2020, 11:19pm

Hello there,

I am building a Bert Binary classification with imbalanced data set (Y: 1 vs. N 9 ratio).
I read somewhere else that I have two options:

Calculate weighted loss using nn.BCEWithLogitsLoss(pos_weight = n_negative/n_positive). – i tried weight from 9 to 2
Oversampling the minority class (class 1 in my case) .

First to confirm - we don’t usually use the method 1 & 2 together right? We pick only one of it.
My evaluation metric of my model is precision but I also don’t want a too low recall. Previously if I don’t use any of those two methods, my valid precision is around 54% while recall can be lower than 10%. Later on after I try method 1, my recall go up to 30% while my precision go down to 30% too.

So i am now trying the second method:
Here’s my question:
Do you use the random sampler for Train set only or for Train, Valid and Test data set?

Currently my loss function is under my model.py. So if I use method 2 - theoretically I should use unweighted loss calculation. However, in that case, if I only use the WeightedRandom Sampler on Train data set, then my valid and test data set will still have 1/9 ration while using unweighted loss calcualtion.
If I use the WeightedRandom Sampler on all three data set, then I think it’s going to be biased result, since future data to predict may still have 1/9 ratio not the oversampling ratio.

Can you help me to understand how should i oversampling my data? Thank you so much!

Here’s how i calculate loss in the model.py

        review_outputs = self.bert(
            review_input_ids,
            attention_mask=review_attention_mask,
            token_type_ids=review_token_type_ids,
            position_ids=None,
            head_mask=None,
            inputs_embeds=None,
        )
        feature = review_outputs[1]
        logits = self.classifier(feature).squeeze()
        outputs = (logits,)  # + outputs[2:]  # add hidden states and attention if they are here
        if labels is not None:
            pos_weight=torch.tensor(9.0)
            loss_fct = nn.BCEWithLogitsLoss().cuda() #pos_weight=pos_weight, will add this if use unweighted loss
            loss = loss_fct(logits, labels)
            outputs = (loss,) + outputs

And here is how I decide the sampler: Only list the Train one since the Valid and Test will be similar

target = train_dataset[:]['label']
class_sample_count = np.unique(target, return_counts=True)[1]
class_sample_count[1] = class_sample_count[1] * 2
weight = 1. / class_sample_count
samples_weight = weight[target]
samples_weight = torch.from_numpy(samples_weight)
train_sampler = torch.utils.data.WeightedRandomSampler(samples_weight, len(samples_weight))

train_data_loader = DataLoader(train_dataset, batch_size=batch_size, sampler = train_sampler, num_workers=args.num_workers, drop_last=True)

localh · September 29, 2020, 12:21am

There is quite a lot of discussion on WeightedRandomSampler here, but to your point – you should use the random sampler only on the training data. While training, you are updating each parameter to minimize the loss between predicted y and actual y. And you are trying to do this in a generalized manner so we are learning our weights training, and then seeing how they perform on our validation set (which should just be shuffled).

If you want to penalize the model for making mistakes in finding your imbalanced minority class, you can also play with adding weights to nn.BCEWithLogitsLoss().