Some problems with WeightedRandomSampler

Ah OK, I see. The docs currently say weights should be a sequence, but maybe we should add some more information on the shape.
What happened is, that the additional dimension treats the weights as different distributions:

weights = torch.empty(10).uniform_()
print(torch.multinomial(weights, 10, True))
> tensor([6, 6, 6, 0, 4, 2, 4, 5, 6, 6])
weights = torch.empty(10, 1).uniform_()
print(torch.multinomial(weights, 10, True))
> tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]])

So basically you received just the 0th sample a lot of times, since each weight row has only one value.

1 Like

Yeah I see that now while I was trying to print the samples.

Anyway I manage to get my code work but I still have doubt about the sampler. My weights for each class are these:


[0.00961538 0.00155763 0.00127551]

and that’s correct since my class_0 have only few occurrences. When I don’t specify any Sampler I get a class distribution in every batch that looks like this (with batch_size 170):


(array([0, 1, 2]), array([ 6, 75, 89], dtype=int64))

(array([0, 1, 2]), array([11, 65, 94], dtype=int64))

(array([0, 1, 2]), array([13, 80, 77], dtype=int64))

(array([0, 1, 2]), array([15, 73, 82], dtype=int64))

(array([0, 1, 2]), array([10, 66, 94], dtype=int64))

and this looks good, in fact it represents the distribution of the class in my dataset as I was expecting.

But with the weighted sampler I get:


array([0, 1, 2]), array([ 1, 66, 103], dtype=int64))

(array([0, 1, 2]), array([ 1, 75, 94], dtype=int64))

(array([0, 1, 2]), array([ 4, 72, 94], dtype=int64))

(array([0, 1, 2]), array([ 1, 75, 94], dtype=int64))

(array([0, 1, 2]), array([ 4, 61, 105], dtype=int64))

(array([0, 1, 2]), array([ 3, 61, 106], dtype=int64))

What I was expecting is to get more samples of my lower presence class but I get less samples of it instead. Furthermore sometimes there are no samples of the class_0 in my batch and this totally mess up with my metrics evaluation. Do you think this is working properly or is there still some bugs?

Based on your weights, I assume you might have multiples of this distribution:

class_counts = torch.tensor([104, 642, 784])

If so, I’ve manipulated my example code to use your weights and data distribution to get approx. equally distributed batches:

# Create dummy data with class imbalance 99 to 1
class_counts = torch.tensor([104, 642, 784])
numDataPoints = class_counts.sum()
data_dim = 5
bs = 170
data = torch.randn(numDataPoints, data_dim)

target = torch.cat((torch.zeros(class_counts[0], dtype=torch.long),
                    torch.ones(class_counts[1], dtype=torch.long),
                    torch.ones(class_counts[2], dtype=torch.long) * 2))

print('target train 0/1/2: {}/{}/{}'.format(
    (target == 0).sum(), (target == 1).sum(), (target == 2).sum()))

# Compute samples weight (each sample should get its own weight)
class_sample_count = torch.tensor(
    [(target == t).sum() for t in torch.unique(target, sorted=True)])
weight = 1. / class_sample_count.float()
samples_weight = torch.tensor([weight[t] for t in target])

# Create sampler, dataset, loader
sampler = WeightedRandomSampler(samples_weight, len(samples_weight))
train_dataset = torch.utils.data.TensorDataset(data, target)
#train_dataset = triaxial_dataset(data, target)
train_loader = DataLoader(
    train_dataset, batch_size=bs, num_workers=0, sampler=sampler)

# Iterate DataLoader and check class balance for each batch
for i, (x, y) in enumerate(train_loader):
    print("batch index {}, 0/1/2: {}/{}/{}".format(
        i, (y == 0).sum(), (y == 1).sum(), (y == 2).sum()))

> target train 0/1/2: 104/642/784
batch index 0, 0/1/2: 52/60/58
batch index 1, 0/1/2: 63/60/47
batch index 2, 0/1/2: 62/58/50
batch index 3, 0/1/2: 59/60/51
batch index 4, 0/1/2: 45/65/60
batch index 5, 0/1/2: 59/60/51
batch index 6, 0/1/2: 54/56/60
batch index 7, 0/1/2: 59/60/51
batch index 8, 0/1/2: 57/64/49

Could you compare your code to mine and let me know, if you get stuck somewhere?

1 Like

Hi, Peter, I would like to sample a mask one time, I already generate the weight for one image, the weight size is 128 x128. the sample_weight size is 128 x128 too. But to build the sampler, using the sampler = WeightedRandomSampler(sample_weight, len(samples_weight)) doesn’t work cause my sample_weight is a 2D tensor.
Do you know in this case, how to build a sampler?

Hi Ptrblck,
I just decide to deal with the class imbalance using the weighted NLLloss. so the weight assigned to this function is the same as you mentioned before, i.e. the samples_weight right?

The weight argument for nn.NLLLoss has to be a tensor containing the class weights, not the sample weights, i.e. weight should have the shape [number_of_classes].

Let’s continue the discussion about the WeightedRandomSampler in this thread.

hello @ptrblck
I have tried your suggestion on other posts including this as well. And my code is attached below. First issue is that len(sampler) is not equal to len(train_loader). Secondly I am reciving this issue upon running my code with this sampler.without sampler it was working fine.

RuntimeError: [enforce fail at …\c10\core\CPUAllocator.cpp:72] data. DefaultCPUAllocator: not enough memory: you tried to allocate 1895474592648 bytes. Buy new RAM!

tar = pd.read_csv(‘E:\Biglabelsjustclassnum.csv’)
tar_lab = LABELS_WEIGHTS[tar]
samples_weight = torch.from_numpy(tar_lab)
sampler = torch.utils.data.WeightedRandomSampler(samples_weight, len(samples_weight))
data_split=0.70
L_data = len(data)
lengths = [int(L_datadata_split), 25000 , L_data - int(L_data(data_split))-25000]
train_set, val_set, test_set = torch.utils.data.random_split(data,lengths)

train_loader=torch.utils.data.DataLoader(train_set,batch_size=BATCH_SIZE, num_workers=4, pin_memory = True, sampler = sampler)
val_loader=torch.utils.data.DataLoader(val_set,batch_size=BATCH_SIZE, num_workers=3,pin_memory = True, sampler = sampler)

I have been reading the various discussions about WeigthedRandomSampler, but I still do not understand what weights[train_targets] is. Could you explain this further for a complete beginner like me? Is it a list of all the labels in the training dataset? I have a directory for each of my classes that contains those images, is that a normal or special case?

1 Like

The weights tensor will contain the reciprocal of the class counts.
So let’s say your class distribution is:

class0 = 100
class1 = 10
class2 = 1000

class_counts = [class0, class1, class2]
weight = 1. / torch.tensor(class_counts).float()
print(weight)
> tensor([0.0100, 0.1000, 0.0010])

As you can see, class1 with the lowest number of samples has the highest weight now.

However, for the WeightedRandomSampler we need to provide a weight for each sample.
So if your target is defined as:

target = torch.cat((
    torch.zeros(class0), torch.ones(class1), torch.ones(class2)*2.)).long()
# shuffle
target = target[torch.randperm(len(target))]

we can directly index weight to get the corresponding weight for each target sample:

# Get corresponding weight for each target
sample_weight = weight[target]
2 Likes

What do you mean a weight for each sample? All samples of a class will have the same weight, right?

My images are in class specific subfolders, is it possible to index all the samples across the subfolders?

Using target([0, 0, 2, 2, 1, 1]) we will create sample_weight([0.01, 0.01, 0.001, 0.001, 0.1, 0.1]).

To create the weights, you would need the targets first, so you might need to iterate your Dataset once and store the targets.

1 Like

OK, I think I understand the sample_weight now.

Do you have an example of iterating the dataset and storing the targets?

If your dataset returns a data and target sample, this should work:

targets = []
for _, target in dataset:
    targets.append(target)
targets = torch.stack(targets)

Thanks for the help! It seems to work now with the modification below.

targets = []
for _, target in datasets:
    targets.append(target)
#targets = torch.stack(targets) #concatenates tensors
targets = torch.tensor(targets)

it is taking too much time append the targets I have around 40,000 files

its taking so much to append the target files I have dataset around 40,000 files

You would have to grab the target values only once and could store it in a tensor.
Don’t append the files, but the target values only.

Here is the code

ListDataset is a class

dataset = ListDataset(train_path, augment=True, multiscale=opt.multiscale_training)
targets = []
for _, _, target in dataset:
targets.append(target)
targets = torch.stack(targets)
print(targets)
print(type(targets))
sample_weights = weight[targets]
dataloader = torch.utils.data.DataLoader(
dataset,
sampler=WeightedRandomSampler(weights=sample_weights, num_samples=len(sample_weights), replacement=True),
batch_size=opt.batch_size,
shuffle=False,
num_workers=opt.n_cpu,
pin_memory=True,
collate_fn=dataset.collate_fn,
)

here is the ListDataset class

class ListDataset(Dataset):
def init(self, list_path, img_size=416, augment=True, multiscale=True, normalized_labels=True):
with open(list_path, “r”) as file:
self.img_files = file.readlines()

    self.label_files = [
        path.replace("images", "labels").replace(".png", ".txt").replace(".jpg", ".txt")
        for path in self.img_files
    ]
    self.img_size = img_size
    self.max_objects = 100
    self.augment = augment
    self.multiscale = multiscale
    self.normalized_labels = normalized_labels
    self.min_size = self.img_size - 3 * 32
    self.max_size = self.img_size + 3 * 32
    self.batch_count = 0

def __getitem__(self, index):

    # ---------
    #  Image
    # ---------

    img_path = self.img_files[index % len(self.img_files)].rstrip()

    # Extract image as PyTorch tensor
    img = transforms.ToTensor()(Image.open(img_path).convert('RGB'))

    # Handle images with less than three channels
    if len(img.shape) != 3:
        img = img.unsqueeze(0)
        img = img.expand((3, img.shape[1:]))

    _, h, w = img.shape
    h_factor, w_factor = (h, w) if self.normalized_labels else (1, 1)
    # Pad to square resolution
    img, pad = pad_to_square(img, 0)
    _, padded_h, padded_w = img.shape

    # ---------
    #  Label
    # ---------

    label_path = self.label_files[index % len(self.img_files)].rstrip()

    targets = None
    if os.path.exists(label_path):
        boxes = torch.from_numpy(np.loadtxt(label_path).reshape(-1, 5))
        # Extract coordinates for unpadded + unscaled image
        x1 = w_factor * (boxes[:, 1] - boxes[:, 3] / 2)
        y1 = h_factor * (boxes[:, 2] - boxes[:, 4] / 2)
        x2 = w_factor * (boxes[:, 1] + boxes[:, 3] / 2)
        y2 = h_factor * (boxes[:, 2] + boxes[:, 4] / 2)
        # Adjust for added padding
        x1 += pad[0]
        y1 += pad[2]
        x2 += pad[1]
        y2 += pad[3]
        # Returns (x, y, w, h)
        boxes[:, 1] = ((x1 + x2) / 2) / padded_w
        boxes[:, 2] = ((y1 + y2) / 2) / padded_h
        boxes[:, 3] *= w_factor / padded_w
        boxes[:, 4] *= h_factor / padded_h

        targets = torch.zeros((len(boxes), 6))
        targets[:, 1:] = boxes

    # Apply augmentations
    if self.augment:
        if np.random.random() < 0.5:
            img, targets = horisontal_flip(img, targets)

    return img_path, img, targets

def collate_fn(self, batch):
    paths, imgs, targets = list(zip(*batch))

    # Remove empty placeholder targets
    targets = [boxes for boxes in targets if boxes is not None]
    # Add sample index to targets
    for i, boxes in enumerate(targets):
        boxes[:, 0] = i
    targets = torch.cat(targets, 0)
    # Selects new image size every tenth batch
    if self.multiscale and self.batch_count % 10 == 0:
        self.img_size = random.choice(range(self.min_size, self.max_size + 1, 32))
    # Resize images to input shape
    imgs = torch.stack([resize(img, self.img_size) for img in imgs])
    self.batch_count += 1
    # if targets[0][1] == 0:
    #     pass
    
    return paths, imgs, targets

def __len__(self):
    return len(self.img_files)

Hello,

Can some one give me reference how to implement weighted random sampler for text dataset

Hi i am trying below:on text data nlp
target_list = torch.tensor(train_data[‘label’])
target_list = target_list[torch.randperm(len(target_list))]
class_count = [i for i in target_list ]
class_weights = 1./torch.tensor(class_count, dtype=torch.float)
class_weights_all = class_weights[target_list]
print(class_weights
but weight are not correct:

weight:[1., 1., 1., 1., 1., 1., inf, inf, 1., 1., 1., 1., inf, 1., 1., inf, 1., inf,
1., inf, 1., 1., 1., inf, 1., 1., 1., 1., 1., 1., 1., inf, 1., 1., 1., 1.,
1., 1., inf, 1., 1., 1., 1., 1., 1., 1., inf, 1., 1., inf, 1., inf, 1., 1.,
inf, 1., 1., 1., inf, 1., 1., 1., inf, 1., 1., 1., 1., 1., 1., 1., 1., 1.,
1., 1., 1., 1., 1., 1., 1., inf, inf, inf, inf, inf, 1., inf, 1.,