Hello there!
I’ve been trying to get a WeightedRandomSampler to work but somehow allways end up not getting the expected output… Following some posts here on the PyTorch forums this is what I have so far:
The dataset I am working with contains both the training and validation sets and is highly unbalanced: 0/1 -> 7153/1532
First, to separete the data into two sets I run the following code:
validation_split = .1
dataset_size = len(labels)
indices = list(range(dataset_size))
split = int(np.floor(validation_split * dataset_size))
np.random.shuffle(indices)
train_indices, val_indices = indices[split:], indices[:split]
training_images = np.array(images_path)[train_indices]
training_labels = np.array(labels)[train_indices]
validation_images = np.array(images_path)[val_indices]
validation_labels = np.array(labels)[val_indices]
Here, labels is a list of all the labels present in the dataset and images_path a list containing the path to each image in the same order as the labels.
Then I create the two distinct datasets:
data_transform = transforms.Compose([RandomCrop(size=IMG_SIZE-20, padding=(10,10)),
Resize(IMG_SIZE),
RandomRotation(20),
ToTensor(),
Normalize()])
training_dataset = CustomDataset(images_path=training_images, labels=training_labels, transform=data_transform)
validation_dataset = CustomDataset(images_path=validation_images, labels=validation_labels)
CustomDataset is definde as follows:
class CustomDataset(Dataset):
def __init__(self, images_path, labels, transform=None):
self.labels = labels
self.images_path = images_path
self.transform = transform
def __len__(self):
return len(self.labels)
def __getitem__(self, idx):
if torch.is_tensor(idx):
idx = idx.tolist()
img = Image.open(images_path[idx])
label = labels[idx]
if label == 1 and self.transform:
sample = {'image': img, 'label': label}
sample = self.transform(sample)
elif label == 0:
img = torch.from_numpy(np.array(img))
img = (img - 127.5) / 127.5
sample = {'image': img, 'label': torch.tensor(label, dtype=torch.long)}
return sample
Note how the transforms are only applied to the underepresented class.
And then my idea was to use a DataLoader with a WeightedRandomSampler to get balanced batches on each iteration during training. This is how I have tried to achieve this:
balance = torch.FloatTensor(balance)
weights = balance / len(labels)
samples_weight = torch.tensor([weights[t] for t in training_labels])
weighted_sampler = WeightedRandomSampler(samples_weight, len(samples_weight))
train_loader = DataLoader(training_dataset, batch_size=BATCH_SIZE, num_workers=0, sampler=weighted_sampler)
val_loader = DataLoader(validation_dataset, batch_size=BATCH_SIZE, num_workers=0)
Weights contains: tensor([0.1764, 0.8236])
When I run the following code:
total_0 = 0
total_1 = 0
for batch in train_loader:
total_0 += (batch['label'] == 0).sum()
total_1 += (batch['label'] == 1).sum()
print(total_0, total_1)
This is the output I get:
tensor(6576) tensor(1241)
Shouldn’t the two be balanced? Each batch contains a distribution similar to: tensor(108) tensor(20)
I have tried to reverse the weights so that it returns tensor([0.8236, 0.1764]) (just to see what would happen) and the output is exactly the same. Is the sampler not applying any weights at all?
This is my first attempt at learning about PyTorch so I might be missing something important (and hopefully easy to fix), most of this concepts are new to me. Any ideas on what to test or change will be extremly appreciated!
Thank you for your time!