Some problems with WeightedRandomSampler

Dear groupers,
I work on an unbalanced dataset. There are six class in my dataset. The first class has 568330 samples, the second class has 43000 samples, the third class has 34900, the fourth class has 20910, the fifth class has 14590, and the last class has 9712 class. I used WeighedRandomSampler in my dataloader. The purpose of my dataloader is each class can sampling averagely. But i found when i used WeightedRandomSampler, the class in one batch size always is class first class, second class and third class. No fourth and fifth class in each batchsize. Here is my code snippet.
class_sample_counts = [568330.0, 43000.0, 34900.0, 20910.0, 14590.0, 9712.0]
class_weights = 1./torch.tensor(class_sample_counts)
sampler = torch.utils.data.sampler.WeightedRandomSampler(class_weights, num_samples=len(my_dataset), replacement=True)
loader = torch.utils.data.DataLoader(
dataset=my_dataset,
batch_size=batch_size,
sampler=sampler,
pin_memory=False,
num_workers=number_workers,
)
Can anyone help me to check my problems? Thanks a lot!

1 Like

I think you might pass the wrong weights to WeightedRandomSampler.
The sequence of weights should correspond to your samples in the dataset.
Here is a small example:

weights = 1. / torch.tensor(class_sample_counts, dtype=torch.float)
samples_weights = weights[train_targets]

sampler = WeightedRandomSampler(
    weights=samples_weights,
    num_samples=len(samples_weights),
    replacement=True)
14 Likes

Thank you for your prompt reply and your kindest help! I still confuse about your small example. Could you please explain more details about the train_targets in the sample_weights = weights[train_targets]?

1 Like

This line of code creates a tensor containing the corresponding weight value for each sample.
While weights has the shape [num_classes], sample_weights has [num_samples] containing [weights[y0], weights[y1], ...].

4 Likes

Hello Peter,
(I hope that’s your intended name)

Am doing something similar but, my target isn’t just a class it’s rather a tensor of classes.

data1 = ([‘EU’, ‘rejects’, ‘German’, ‘call’, ‘to’, ‘boycott’, ‘British’, ‘lamb’, ‘.’],[‘B-ORG\n’, ‘O\n’, ‘B-MISC\n’, ‘O\n’, ‘O\n’, ‘O\n’, ‘B-MISC\n’, ‘O\n’, ‘O\n’])
The first list is the sentence while the second is the list of classes for each word.

This is further passed into a windower to give an output like so: (with relevant tags)

[’<PAD>’, ‘<PAD>’, ‘<PAD>’, ‘<PAD>’, ‘EU’, ‘rejects’, ‘German’, ‘call’, ‘to’]
[’<PAD>’, ‘<PAD>’, ‘<PAD>’, ‘EU’, ‘rejects’, ‘German’, ‘call’, ‘to’, ‘boycott’]
[’<PAD>’, ‘<PAD>’, ‘EU’, ‘rejects’, ‘German’, ‘call’, ‘to’, ‘boycott’, ‘British’]
[’<PAD>’, ‘EU’, ‘rejects’, ‘German’, ‘call’, ‘to’, ‘boycott’, ‘British’, ‘lamb’]
[‘EU’, ‘rejects’, ‘German’, ‘call’, ‘to’, ‘boycott’, ‘British’, ‘lamb’, ‘.’]

Now sampling is messing up with my sentence structure in a very wrong way and hence the word contexts for an NER task are lost.

Check an example below:

[(’<PAD>’, ‘<PAD>’, ‘<PAD>’, ‘German’, ‘EU’, ‘<PAD>’, ‘German’, ‘<PAD>’, ‘<PAD>’, ‘<PAD>’, ‘<PAD>’, ‘<PAD>’, ‘rejects’, ‘<PAD>’, ‘<PAD>’, ‘to’, ‘German’, ‘rejects’, ‘<PAD>’, ‘<PAD>’), (’<PAD>’, ‘EU’, ‘EU’, ‘call’, ‘rejects’, ‘<PAD>’, ‘call’, ‘<PAD>’, ‘<PAD>’, ‘EU’, ‘<PAD>’, ‘EU’, ‘German’, ‘<PAD>’, ‘<PAD>’, ‘boycott’, ‘call’, ‘German’, ‘<PAD>’, ‘<PAD>’), (‘EU’, ‘rejects’, ‘rejects’, ‘to’, ‘German’, ‘<PAD>’, ‘to’, ‘<PAD>’, ‘<PAD>’, ‘rejects’, ‘EU’, ‘rejects’, ‘call’, ‘EU’, ‘EU’, ‘British’, ‘to’, ‘call’, ‘<PAD>’, ‘<PAD>’), (‘rejects’, ‘German’, ‘German’, ‘boycott’, ‘call’, ‘<PAD>’, ‘boycott’, ‘<PAD>’, ‘<PAD>’, ‘German’, ‘rejects’, ‘German’, ‘to’, ‘rejects’, ‘rejects’, ‘lamb’, ‘boycott’, ‘to’, ‘<PAD>’, ‘<PAD>’), (‘German’, ‘call’, ‘call’, ‘British’, ‘to’, ‘Peter’, ‘British’, ‘Peter’, ‘Peter’, ‘call’, ‘German’, ‘call’, ‘boycott’, ‘German’, ‘German’, ‘.’, ‘British’, ‘boycott’, ‘Peter’, ‘Peter’), (‘call’, ‘to’, ‘to’, ‘lamb’, ‘boycott’, ‘Blackburn’, ‘lamb’, ‘Blackburn’, ‘Blackburn’, ‘to’, ‘call’, ‘to’, ‘British’, ‘call’, ‘call’, ‘<PAD>’, ‘lamb’, ‘British’, ‘Blackburn’, ‘Blackburn’), (‘to’, ‘boycott’, ‘boycott’, ‘.’, ‘British’, ‘<PAD>’, ‘.’, ‘<PAD>’, ‘<PAD>’, ‘boycott’, ‘to’, ‘boycott’, ‘lamb’, ‘to’, ‘to’, ‘<PAD>’, ‘.’, ‘lamb’, ‘<PAD>’, ‘<PAD>’), (‘boycott’, ‘British’, ‘British’, ‘<PAD>’, ‘lamb’, ‘<PAD>’, ‘<PAD>’, ‘<PAD>’, ‘<PAD>’, ‘British’, ‘boycott’, ‘British’, ‘.’, ‘boycott’, ‘boycott’, ‘<PAD>’, ‘<PAD>’, ‘.’, ‘<PAD>’, ‘<PAD>’), (‘British’, ‘lamb’, ‘lamb’, ‘<PAD>’, ‘.’, ‘<PAD>’, ‘<PAD>’, ‘<PAD>’, ‘<PAD>’, ‘lamb’, ‘British’, ‘lamb’, ‘<PAD>’, ‘British’, ‘British’, ‘<PAD>’, ‘<PAD>’, ‘<PAD>’, ‘<PAD>’, ‘<PAD>’)] [(‘PAD’, ‘PAD’, ‘PAD’, ‘B-MISC\n’, ‘B-ORG\n’, ‘PAD’, ‘B-MISC\n’, ‘PAD’, ‘PAD’, ‘PAD’, ‘PAD’, ‘PAD’, ‘O\n’, ‘PAD’, ‘PAD’, ‘O\n’, ‘B-MISC\n’, ‘O\n’, ‘PAD’, ‘PAD’), (‘PAD’, ‘B-ORG\n’, ‘B-ORG\n’, ‘O\n’, ‘O\n’, ‘PAD’, ‘O\n’, ‘PAD’, ‘PAD’, ‘B-ORG\n’, ‘PAD’, ‘B-ORG\n’, ‘B-MISC\n’, ‘PAD’, ‘PAD’, ‘O\n’, ‘O\n’, ‘B-MISC\n’, ‘PAD’, ‘PAD’), (‘B-ORG\n’, ‘O\n’, ‘O\n’, ‘O\n’, ‘B-MISC\n’, ‘PAD’, ‘O\n’, ‘PAD’, ‘PAD’, ‘O\n’, ‘B-ORG\n’, ‘O\n’, ‘O\n’, ‘B-ORG\n’, ‘B-ORG\n’, ‘B-MISC\n’, ‘O\n’, ‘O\n’, ‘PAD’, ‘PAD’), (‘O\n’, ‘B-MISC\n’, ‘B-MISC\n’, ‘O\n’, ‘O\n’, ‘PAD’, ‘O\n’, ‘PAD’, ‘PAD’, ‘B-MISC\n’, ‘O\n’, ‘B-MISC\n’, ‘O\n’, ‘O\n’, ‘O\n’, ‘O\n’, ‘O\n’, ‘O\n’, ‘PAD’, ‘PAD’), (‘B-MISC\n’, ‘O\n’, ‘O\n’, ‘B-MISC\n’, ‘O\n’, ‘B-PER\n’, ‘B-MISC\n’, ‘B-PER\n’, ‘B-PER\n’, ‘O\n’, ‘B-MISC\n’, ‘O\n’, ‘O\n’, ‘B-MISC\n’, ‘B-MISC\n’, ‘O\n’, ‘B-MISC\n’, ‘O\n’, ‘B-PER\n’, ‘B-PER\n’), (‘O\n’, ‘O\n’, ‘O\n’, ‘O\n’, ‘O\n’, ‘I-PER\n’, ‘O\n’, ‘I-PER\n’, ‘I-PER\n’, ‘O\n’, ‘O\n’, ‘O\n’, ‘B-MISC\n’, ‘O\n’, ‘O\n’, ‘PAD’, ‘O\n’, ‘B-MISC\n’, ‘I-PER\n’, ‘I-PER\n’), (‘O\n’, ‘O\n’, ‘O\n’, ‘O\n’, ‘B-MISC\n’, ‘PAD’, ‘O\n’, ‘PAD’, ‘PAD’, ‘O\n’, ‘O\n’, ‘O\n’, ‘O\n’, ‘O\n’, ‘O\n’, ‘PAD’, ‘O\n’, ‘O\n’, ‘PAD’, ‘PAD’), (‘O\n’, ‘B-MISC\n’, ‘B-MISC\n’, ‘PAD’, ‘O\n’, ‘PAD’, ‘PAD’, ‘PAD’, ‘PAD’, ‘B-MISC\n’, ‘O\n’, ‘B-MISC\n’, ‘O\n’, ‘O\n’, ‘O\n’, ‘PAD’, ‘PAD’, ‘O\n’, ‘PAD’, ‘PAD’), (‘B-MISC\n’, ‘O\n’, ‘O\n’, ‘PAD’, ‘O\n’, ‘PAD’, ‘PAD’, ‘PAD’, ‘PAD’, ‘O\n’, ‘B-MISC\n’, ‘O\n’, ‘PAD’, ‘B-MISC\n’, ‘B-MISC\n’, ‘PAD’, ‘PAD’, ‘PAD’, ‘PAD’, ‘PAD’)

Regards

Could you provide the Dataset code and how you are calculating the weights?
I’m not an NLP expert, but I guess your sentences shouldn’t be split by the sampling, so it’s interesting to see, how you are dealing with your data generally.
Also, you might want to have a look at torchtext.

Hello Peter,

This is a conll2003 corpus and a sentence looks like one below:

EU NNP B-NP B-ORG
rejects VBZ B-VP O
German JJ B-NP B-MISC
call NN I-NP O
to TO B-VP O
boycott VB I-VP O
British JJ B-NP B-MISC
lamb NN I-NP O

All I have to do is extract the word and it’s NER tag in the right extreme and make my sentence look like one below:

([‘EU’, ‘rejects’, ‘German’, ‘call’, ‘to’, ‘boycott’, ‘British’, ‘lamb’, ‘.’] → One Image
[‘B-ORG\n’, ‘O\n’, ‘B-MISC\n’, ‘O\n’, ‘O\n’, ‘O\n’, ‘B-MISC\n’, ‘O\n’, ‘O\n’]) → set of tags/labels for an image

Each unique word and tag has an index like so:

{‘B-LOC\n’: 4,
‘B-MISC\n’: 9,
‘B-ORG\n’: 6,
‘B-PER\n’: 7,
‘I-LOC\n’: 8,
‘I-MISC\n’: 3,
‘I-ORG\n’: 1,
‘I-PER\n’: 5,
‘O\n’: 2,
‘PAD’: 0}

Here each word in a sentence is converted to its indices like so:

[ 0, 10397, 21273, 16075, 20124, 15819, 2729, 13298, 14387]

; and is embedded through a nn.Embedding layer (so that each word is a 50 dimensional vector)

and now a single sentence becomes a 2D image. But, instead of having a single label to an image, I have a list of labels to an image. These labels are the NER tags of each word. The problem is, my data-set has a lot of words of ‘O\n’ class as pointed in the comment earlier and so, my model tends to predict the dominant class (typical class imbalance problem).
So, I need to balance these classes.

The code to calculate weights:

indexed_counts #frequency of each class

{0: 112328,
1: 3704,
2: 169578,
3: 1155,
4: 7140,
5: 4528,
6: 6321,
7: 6600,
8: 1157,
9: 3438}

tag_weights = {}
for key in indexed_counts.keys():
    tag_weights[key] = 1/indexed_counts[key]
sampler = [i[1] for i in sorted(tag_weights.items())]
sampler

[8.902499821950003e-06,
0.0002699784017278618,
5.896991355010673e-06,
0.0008658008658008658,
0.00014005602240896358,
0.00022084805653710247,
0.00015820281601012498,
0.00015151515151515152,
0.000864304235090752,
0.00029086678301337986]

(As a side note, am passing this as my weights tensor for the cross entropy loss function)

Just for you to compare that the tag_weights and the sampler is in the same order of the weight classes
tag_weights

{0: 8.902499821950003e-06, 1: 0.0002699784017278618, 2: 5.896991355010673e-06, 3: 0.0008658008658008658, 4: 0.00014005602240896358, 5: 0.00022084805653710247, 6: 0.00015820281601012498, 7: 0.00015151515151515152, 8: 0.000864304235090752, 9: 0.00029086678301337986}

Coming to your suggestion of splitting by classes, splitting a sentence without maintaining the sequence of words before and after, would imply the context of a word might be lost. The model needs to learn the NER tags of a given word given the contextual meaning as well (words appearing before and after a given word).

On a different note, is there a way to penalize my model basing upon the F1 score? Instead of considering sampling as a solution, can we somehow train our model towards a better F1 score? So the point being, how to choose the f1 score as a metric for my model instead of the loss only?

You can keep a track of F1 on the validation data and decide when to stop the training based on the best F1. I am using ignite library to do a similar thing. You can create a metric: F1Weighted(Metric): using the Metric class from pytorch-ignite. Then, add the metric as follows:

evaluator = create_supervised_evaluator(model,
                                                metrics={'f_1': F1Weighted})

You can use earlystopping from ignite to stop training and also modelcheckpointing. Its all present in ignite.

2 Likes

Thanks for the new lead on a new day! :slight_smile:
I’ll get back to you for nay help, if needed,soon.

I have unbalanced data and want to use this solution. However, I get this error:

UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor).
self.weights = torch.tensor(weights, dtype=torch.double)

Can you please help?

classcount = np.bincount(trainset.labels).tolist()
train_weights = 1./torch.tensor(classcount, dtype=torch.float)
train__sampleweights = train_weights[trainset.labels]
train_sampler = Sampler.WeightedRandomSampler(weights=train__sampleweights, num_samples=len(train__sampleweights))
trainloader = DataLoader(trainset, sampler=train_sampler, shuffle=False)

It looks like train_sampleweights is a numpy array.
Could you convert it to a torch.tensor using train_sampleweights = torch.from_numpy(train_sampleweights) before passing it to the WeightedRandomSampler?

Thank you for replying. Unfortunately, it gave this error:

train_sampleweights = torch.from_numpy(train_sampleweights)
TypeError: expected np.ndarray (got Tensor)

So when I print out some information on train_sampleweights, I think it may help, but I’m not sure what the class constructor wants.

train_weights = tensor([4.3002e-06, 4.3002e-06, 4.3002e-06,  ..., 4.3002e-06, 4.3002e-06, 4.3002e-06])
train_weights shape=  torch.Size([246048])

Thanks for the info. I was mistaken and your code should work fine.
You can ignore this warning for now, as it’s being thrown internally in this line of code.
I don’t think it will cause any trouble.

CC @SimonW I think this change was made in this PR.
Do you think it should be changed?

oh sure. the warning was introduced later after that PR was merged, but we should change this. Could you open an issue?

1 Like

Sure!
I’ll open an issue and suggest a fix.

1 Like

Hello everyone,

I’m replying here since my problem is like the first question asked and I avoid opening another thread.

I’m struggling with a 3 classes problem with great unbalance. In the following code I create the weight array ( first the array with dimension [ num_classes] then the one with dimension [data_length]

weight = np.zeros(self.num_classes)
weight = 1. / count_train
self.train_samples_weight = [weight[cla] for cla in Labels_train]
self.test_samples_weight = [weight[class_id] for class_id in Labels_test]
self.train_samples_weight = np.asarray(self.train_samples_weight)
self.test_samples_weight = np.asarray(self.test_samples_weight)

and it looks correct to me. Then the code were I create the Sampler and the Loader

train_sampler = torch.utils.data.WeightedRandomSampler(self.train_samples_weight,1,replacement=True)
test_sampler = torch.utils.data.WeightedRandomSampler(self.test_samples_weight,1,replacement=True)
test_loader = DataLoader(dataset=self.dataset_test, batch_size=self.batch_size, sampler=test_sampler,drop_last=True)
train_loader = DataLoader(dataset=self.dataset_train, batch_size=self.batch_size, sampler=train_sampler,drop_last=True)

The first thing is I’m using 1 as num_samples since I want just extract a number of samples correspondant to my batch size. Anyway this is missleading since the other Sampler doesn’t work like this as pointed out here but I guess the problem is in my pytorch version.

The real issue resides in the class distribution when i use enumerate on my DataLoader. In fact I can only get samples of class_2 ( that’s the class with major number of occurrencies) but its weight seems to be the lowest and that looks correct to me. So what I’m doing wrong? I was looking for a way to express a balanced batch size and I still think that’s the correct way to go.

num_samples gives the number of samples to draw, so usually you would let it to the length of your Dataset. Currently you should only get a single batch containing one sample.

Yes, I just checked this and that’s True. Anyway I can’t understand why num_samples should be equal to my dataset dimension. In fact I already viewed example with this settings and that make no sense to me.

Without specifing the Sampler my enumerate(loader) results in batchID,data where data has dimension equal to [batchSize,…My data dimension…] besides, when I use customized Sampler I end up having a data dimneison of [ batchSize, numSamples,…My data dimension…].

Can you explain why this should work like that? I can’t see the point in this. I was expecting to still have same dimensions but with samples choosed according to new criteria.

That shouldn’t be the case and I’ve never seen this behavior before.
Could you post a code snippet, so that we can reproduce this issue?
Here is a dummy example which results in [batch_size, nb_features] for each batch:

# Create dummy data with class imbalance 99 to 1
numDataPoints = 1000
data_dim = 5
bs = 100
data = torch.randn(numDataPoints, data_dim)
target = torch.cat((torch.zeros(int(numDataPoints * 0.99), dtype=torch.long),
                    torch.ones(int(numDataPoints * 0.01), dtype=torch.long)))

print('target train 0/1: {}/{}'.format(
    (target == 0).sum(), (target == 1).sum()))

# Compute samples weight (each sample should get its own weight)
class_sample_count = torch.tensor(
    [(target == t).sum() for t in torch.unique(target, sorted=True)])
weight = 1. / class_sample_count.float()
samples_weight = torch.tensor([weight[t] for t in target])

# Create sampler, dataset, loader
sampler = WeightedRandomSampler(samples_weight, len(samples_weight))
train_dataset = torch.utils.data.TensorDataset(data, target)
train_loader = DataLoader(
    train_dataset, batch_size=bs, num_workers=1, sampler=sampler)

# Iterate DataLoader and check class balance for each batch
for i, (x, y) in enumerate(train_loader):
    print("batch index {}, 0/1: {}/{}".format(
        i, (y == 0).sum(), (y == 1).sum()))
    print("x.shape {}, y.shape {}".format(x.shape, y.shape))
7 Likes