Hello,
I am working on DNA sequences data and using CNN. My dataset is hugely imbalanced.
positive class samples (~500)
negative class samples (~150,000)
So I am using WeightedRandomSampler
to oversample and balance classes before feeding to data loader.
I use a 5-fold cross-validation. When I did few test runs, I could get a decent ROC value but the PR-AUC value seems to be really low.
For fold 1:
roc auc 0.9667848699763594
precision auc 0.055329116326074484
For fold 2:
roc auc 0.8476321207961566
precision auc 0.03307627288669479
For fold 3:
roc auc 0.9528898540612085
precision auc 0.05020178518546394
I suspect that there are lot of false negatives. Since the positive class samples (~500) is very low compared to negative class samples (~150,000) the model learns the negative class better and predicts most of the test samples as negative.
I tried weighing the positive class using
weight = [50.0]
class_weight = torch.FloatTensor(weight).to(device)
criterion = nn.BCEWithLogitsLoss(pos_weight=class_weight)
By doing this, almost all samples are predicted as positive.
I tried Adaptive learning rates as well but the precision-recall values do not seem to improve.
Can someone guide me and let me know the ideas to improve Precision and Recall values?
Thanks!