WeightedRandomSampler decreases quality

111510 · April 18, 2021, 6:40pm

i added WeightedRandomSampler because of my unbalanced datased, but it only decreases stuff…
i use pretraind bert model for token classification
before/after

              precision    recall  f1-score   support				
				
        ANAT     0.4557    0.4186    0.4364        86				
        CHEM     0.7145    0.5354    0.6122       762				
        DEVI     0.3000    0.0789    0.1250        76				
        DISO     0.2898    0.4476    0.3518       248				
        GEOG     0.0000    0.0000    0.0000        14				
        LIVB     0.7382    0.7319    0.7350       235				
        OBJC     0.1064    0.1020    0.1042        49				
        PHEN     0.0000    0.0000    0.0000        29				
        PHYS     0.3889    0.2917    0.3333        72				
        PROC     0.5184    0.5570    0.5370       228				
				
   micro avg     0.5419    0.4925    0.5160      1799				
   macro avg     0.3512    0.3163    0.3235      1799				
weighted avg     0.5577    0.4925    0.5142      1799

              precision    recall  f1-score   support				
				
        ANAT     0.2105    0.5581    0.3057        86				
        CHEM     0.4830    0.4094    0.4432       762				
        DEVI     0.2353    0.0526    0.0860        76				
        DISO     0.1559    0.1169    0.1336       248				
        GEOG     0.0000    0.0000    0.0000        14				
        LIVB     0.5144    0.6085    0.5575       235				
        OBJC     0.0000    0.0000    0.0000        49				
        PHEN     0.0179    0.0345    0.0235        29				
        PHYS     0.0000    0.0000    0.0000        72				
        PROC     0.1374    0.5965    0.2233       228				
				
   micro avg     0.2798    0.3741    0.3202      1799				
   macro avg     0.1754    0.2377    0.1773      1799				
weighted avg     0.3310    0.3741    0.3259      1799

consider weights like this

def get_class_weights():
    df = pd.read_csv(train_file_path, sep='\t', header=None)
    df.columns = ['Text', 'Label']
    new_df = df['Label'].value_counts().to_frame()
    new_df['label'] = new_df.index
    new_df.columns = ['count','label', ]
    new_df['percentage'] = 1 - new_df['count'] / new_df['count'].sum()
    class_weights = new_df['percentage'].to_list()
    dev = torch.device("cuda:0" if torch.cuda.is_available() else "cpu") #sometimes it cant see dev var
    class_weights = torch.FloatTensor(class_weights).to(dev)
    return class_weights

if i pass ones weights nothing changes…

def get_class_fixed_weights(): #manually create weights
    class_weights = torch.ones((21,))
    dev = torch.device("cuda:0" if torch.cuda.is_available() else "cpu") #sometimes it cant see dev var
    class_weights = torch.FloatTensor(class_weights).to(dev)
    return class_weights

using of sampler:

training_set = CustomDataset(tokenizer, train_sentences, train_labels, MAX_LEN)
    testing_set = CustomDataset(tokenizer, test_sentences, test_labels, MAX_LEN)

sampler = WeightedRandomSampler(weights=class_weights, replacement=True, num_samples=len(training_set))

train_params = {'batch_size': TRAIN_BATCH_SIZE, 'shuffle': False, 'num_workers': 0, 'sampler': sampler}
test_params = {'batch_size': VALID_BATCH_SIZE, 'shuffle': False, 'num_workers': 0}
    training_loader = DataLoader(training_set, **train_params)
    testing_loader = DataLoader(testing_set, **test_params)