WeightedRandomSampler not sampling balanced batches

manelpk · July 19, 2020, 10:08am

Hello there!

I’ve been trying to get a WeightedRandomSampler to work but somehow allways end up not getting the expected output… Following some posts here on the PyTorch forums this is what I have so far:

The dataset I am working with contains both the training and validation sets and is highly unbalanced: 0/1 -> 7153/1532
First, to separete the data into two sets I run the following code:

validation_split = .1
dataset_size = len(labels)
indices = list(range(dataset_size))
split = int(np.floor(validation_split * dataset_size))
np.random.shuffle(indices)
train_indices, val_indices = indices[split:], indices[:split]

training_images = np.array(images_path)[train_indices]
training_labels = np.array(labels)[train_indices]
validation_images = np.array(images_path)[val_indices]
validation_labels = np.array(labels)[val_indices]

Here, labels is a list of all the labels present in the dataset and images_path a list containing the path to each image in the same order as the labels.
Then I create the two distinct datasets:

data_transform = transforms.Compose([RandomCrop(size=IMG_SIZE-20, padding=(10,10)),
                                    Resize(IMG_SIZE),
                                    RandomRotation(20),
                                    ToTensor(),
                                    Normalize()])
training_dataset = CustomDataset(images_path=training_images, labels=training_labels, transform=data_transform)
validation_dataset = CustomDataset(images_path=validation_images, labels=validation_labels)

CustomDataset is definde as follows:

class CustomDataset(Dataset):
  def __init__(self, images_path, labels, transform=None):
    self.labels = labels
    self.images_path = images_path
    self.transform = transform

  def __len__(self):
    return len(self.labels)

  def __getitem__(self, idx):
    if torch.is_tensor(idx):
      idx = idx.tolist()
    img = Image.open(images_path[idx])
    label = labels[idx]
    if label == 1 and self.transform:
      sample = {'image': img, 'label': label}
      sample = self.transform(sample)
    elif label == 0:
      img = torch.from_numpy(np.array(img))
      img = (img - 127.5) / 127.5
      sample = {'image': img, 'label': torch.tensor(label, dtype=torch.long)}
    return sample

Note how the transforms are only applied to the underepresented class.

And then my idea was to use a DataLoader with a WeightedRandomSampler to get balanced batches on each iteration during training. This is how I have tried to achieve this:

balance = torch.FloatTensor(balance)
weights = balance / len(labels) 
samples_weight = torch.tensor([weights[t] for t in training_labels])
weighted_sampler = WeightedRandomSampler(samples_weight, len(samples_weight))

train_loader = DataLoader(training_dataset, batch_size=BATCH_SIZE, num_workers=0, sampler=weighted_sampler)
val_loader = DataLoader(validation_dataset, batch_size=BATCH_SIZE, num_workers=0)

Weights contains: tensor([0.1764, 0.8236])

When I run the following code:

total_0 = 0
total_1 = 0
for batch in train_loader:
  total_0 += (batch['label'] == 0).sum()
  total_1 += (batch['label'] == 1).sum()
print(total_0, total_1)

This is the output I get:
tensor(6576) tensor(1241)
Shouldn’t the two be balanced? Each batch contains a distribution similar to: tensor(108) tensor(20)
I have tried to reverse the weights so that it returns tensor([0.8236, 0.1764]) (just to see what would happen) and the output is exactly the same. Is the sampler not applying any weights at all?
This is my first attempt at learning about PyTorch so I might be missing something important (and hopefully easy to fix), most of this concepts are new to me. Any ideas on what to test or change will be extremly appreciated!

Thank you for your time!

Mughees · July 19, 2020, 10:43am

Iam unable to understand the problem correctly, but I can suggest few directions.

Use Splitting with stratification for imbalanced problems.(scikit has a function). It creates the train/test split with equal ratios among classes. see this
In Dataloader, drops_last is also the parameter which you need to focus on because it drops the samples which doesnot fit in the batch size.

Thanks.

manelpk · July 19, 2020, 1:56pm

Hello!

Thank you for your answer!
I tried to follow your first direction but from what I understand this approach would help me in balancing examples between the train and test sets.
What I am trying to achieve is to obtain a well balanced train set, without focusing on the test one at all.

My goal would be to sample balanced batches from the training dataset (carrying out data augmentation only on the minority class since this examples would theoretically get picked more).
Per epoch the idea would be to see all minority examples multiple times (with different transforms applied) and only once the majority examples. Does that make sense?

For the second direction you proposed, i don’t see how it would help in this situation. Could you expand on the topic please?

Again, thank you for your response!

manelpk · July 22, 2020, 10:46am

Found the solution! Here it is, in case someone finds it useful in the future.

I was splitting the data into train and test sets incorreclty (due to a wrong intepretation on how numpy arrays work), so I decided to change the approach by using list comprehensions:

training_images = [images_path[i] for i in train_indices]
training_labels = [labels[i] for i in train_indices]
validation_images = [images_path[i] for i in val_indices]
validation_labels = [labels[i] for i in val_indices]

And now I am getting the expected result from the WeightedRandomSampler, balanced batches on every iteration!