Some problems with WeightedRandomSampler

What do you mean a weight for each sample? All samples of a class will have the same weight, right?

My images are in class specific subfolders, is it possible to index all the samples across the subfolders?

Using target([0, 0, 2, 2, 1, 1]) we will create sample_weight([0.01, 0.01, 0.001, 0.001, 0.1, 0.1]).

To create the weights, you would need the targets first, so you might need to iterate your Dataset once and store the targets.

1 Like

OK, I think I understand the sample_weight now.

Do you have an example of iterating the dataset and storing the targets?

If your dataset returns a data and target sample, this should work:

targets = []
for _, target in dataset:
    targets.append(target)
targets = torch.stack(targets)

Thanks for the help! It seems to work now with the modification below.

targets = []
for _, target in datasets:
    targets.append(target)
#targets = torch.stack(targets) #concatenates tensors
targets = torch.tensor(targets)

it is taking too much time append the targets I have around 40,000 files

its taking so much to append the target files I have dataset around 40,000 files

You would have to grab the target values only once and could store it in a tensor.
Don’t append the files, but the target values only.

Here is the code

ListDataset is a class

dataset = ListDataset(train_path, augment=True, multiscale=opt.multiscale_training)
targets = []
for _, _, target in dataset:
targets.append(target)
targets = torch.stack(targets)
print(targets)
print(type(targets))
sample_weights = weight[targets]
dataloader = torch.utils.data.DataLoader(
dataset,
sampler=WeightedRandomSampler(weights=sample_weights, num_samples=len(sample_weights), replacement=True),
batch_size=opt.batch_size,
shuffle=False,
num_workers=opt.n_cpu,
pin_memory=True,
collate_fn=dataset.collate_fn,
)

here is the ListDataset class

class ListDataset(Dataset):
def init(self, list_path, img_size=416, augment=True, multiscale=True, normalized_labels=True):
with open(list_path, “r”) as file:
self.img_files = file.readlines()

    self.label_files = [
        path.replace("images", "labels").replace(".png", ".txt").replace(".jpg", ".txt")
        for path in self.img_files
    ]
    self.img_size = img_size
    self.max_objects = 100
    self.augment = augment
    self.multiscale = multiscale
    self.normalized_labels = normalized_labels
    self.min_size = self.img_size - 3 * 32
    self.max_size = self.img_size + 3 * 32
    self.batch_count = 0

def __getitem__(self, index):

    # ---------
    #  Image
    # ---------

    img_path = self.img_files[index % len(self.img_files)].rstrip()

    # Extract image as PyTorch tensor
    img = transforms.ToTensor()(Image.open(img_path).convert('RGB'))

    # Handle images with less than three channels
    if len(img.shape) != 3:
        img = img.unsqueeze(0)
        img = img.expand((3, img.shape[1:]))

    _, h, w = img.shape
    h_factor, w_factor = (h, w) if self.normalized_labels else (1, 1)
    # Pad to square resolution
    img, pad = pad_to_square(img, 0)
    _, padded_h, padded_w = img.shape

    # ---------
    #  Label
    # ---------

    label_path = self.label_files[index % len(self.img_files)].rstrip()

    targets = None
    if os.path.exists(label_path):
        boxes = torch.from_numpy(np.loadtxt(label_path).reshape(-1, 5))
        # Extract coordinates for unpadded + unscaled image
        x1 = w_factor * (boxes[:, 1] - boxes[:, 3] / 2)
        y1 = h_factor * (boxes[:, 2] - boxes[:, 4] / 2)
        x2 = w_factor * (boxes[:, 1] + boxes[:, 3] / 2)
        y2 = h_factor * (boxes[:, 2] + boxes[:, 4] / 2)
        # Adjust for added padding
        x1 += pad[0]
        y1 += pad[2]
        x2 += pad[1]
        y2 += pad[3]
        # Returns (x, y, w, h)
        boxes[:, 1] = ((x1 + x2) / 2) / padded_w
        boxes[:, 2] = ((y1 + y2) / 2) / padded_h
        boxes[:, 3] *= w_factor / padded_w
        boxes[:, 4] *= h_factor / padded_h

        targets = torch.zeros((len(boxes), 6))
        targets[:, 1:] = boxes

    # Apply augmentations
    if self.augment:
        if np.random.random() < 0.5:
            img, targets = horisontal_flip(img, targets)

    return img_path, img, targets

def collate_fn(self, batch):
    paths, imgs, targets = list(zip(*batch))

    # Remove empty placeholder targets
    targets = [boxes for boxes in targets if boxes is not None]
    # Add sample index to targets
    for i, boxes in enumerate(targets):
        boxes[:, 0] = i
    targets = torch.cat(targets, 0)
    # Selects new image size every tenth batch
    if self.multiscale and self.batch_count % 10 == 0:
        self.img_size = random.choice(range(self.min_size, self.max_size + 1, 32))
    # Resize images to input shape
    imgs = torch.stack([resize(img, self.img_size) for img in imgs])
    self.batch_count += 1
    # if targets[0][1] == 0:
    #     pass
    
    return paths, imgs, targets

def __len__(self):
    return len(self.img_files)

Hello,

Can some one give me reference how to implement weighted random sampler for text dataset

Hi i am trying below:on text data nlp
target_list = torch.tensor(train_data[‘label’])
target_list = target_list[torch.randperm(len(target_list))]
class_count = [i for i in target_list ]
class_weights = 1./torch.tensor(class_count, dtype=torch.float)
class_weights_all = class_weights[target_list]
print(class_weights
but weight are not correct:

weight:[1., 1., 1., 1., 1., 1., inf, inf, 1., 1., 1., 1., inf, 1., 1., inf, 1., inf,
1., inf, 1., 1., 1., inf, 1., 1., 1., 1., 1., 1., 1., inf, 1., 1., 1., 1.,
1., 1., inf, 1., 1., 1., 1., 1., 1., 1., inf, 1., 1., inf, 1., inf, 1., 1.,
inf, 1., 1., 1., inf, 1., 1., 1., inf, 1., 1., 1., 1., 1., 1., 1., 1., 1.,
1., 1., 1., 1., 1., 1., 1., inf, inf, inf, inf, inf, 1., inf, 1.,

Based on your code snippet it seems you are assigning the class indices to class_count instead of the actual count, thus dividing by 0, which will create the Inf weights. My previous post has an executable code snippet as an example.

target = target[torch.randperm(len(target))]

@ptrblck why do we shuffle the target here?

When I pass the sample_weight to a WeightedRandomSampler (with and without shuffling the target), I am unable to see any advantage of shuffling.

weighted_sampler = WeightedRandomSampler(
    weights=sample_weight,
    num_samples=len(sample_weight),
    replacement=True
)

print(list(weighted_sampler))

It’s an example to show that the target tensor does not need to contain the class indices in order.

1 Like

I am using the WeightedRandomSampler with my imbalanced dataset. I have a total of 129 classes. But not all classes are included in the training. I have used a batch size of 128. Here is the code that I am using:

def make_weights_for_balanced_classes(images, nclasses):                        
    count = [0] * nclasses                                                      
    for item in images:                                                         
        count[item] += 1                                                     
    weight_per_class = [0.] * nclasses                                      
    N = float(sum(count))                                                   
    for i in range(nclasses):                                                   
        weight_per_class[i] = N/float(count[i]) # weight_per_class[i] = N/float(count[i])                           
    weight = [0] * len(images)                                              
    for idx, val in enumerate(images):                                          
        weight[idx] = weight_per_class[val]                                  
    return weight        

weights = make_weights_for_balanced_classes(bark_labels, NUMBER_OF_CLASSES)

weights = torch.DoubleTensor(weights)         
                              
sampler = torch.utils.data.sampler.WeightedRandomSampler(weights, len(weights), replacement=True)  
  
train_loader = torch.utils.data.DataLoader(dataset=train_data,batch_size=batch_size,shuffle=False, collate_fn=collate_fn, sampler = sampler)

@ptrblck
I continue to get same error ,probably because my raw dataset has got too many class instances that crosses 2*24 , is this a limitation for torch.multinomial used by WRS ?

Yes, the limitation is explained in this post where you have also posted this question.

@ptrblck So if i understood correctly, the second tensor (i.e. idx=1) of the tensor dataset is always used as the default in WeightedRandomSampler? If there is a dataset that has multiple tensors and I wish to use the n-th tensor instead of the second one, is there a easy way to do it (instead of swapping the order of tensors)?

I don’t fully understand this claim as you are explicitly responsible to create the sample weights for the WeightedRandomSampler. It won’t use the second tensor from some dataset by itself.