Balanced trainLoader

hosseinshn · September 21, 2018, 8:14pm

Hi,
I am loading a custom data and I’ve read the post here:

I am using the same code as follows:

class_sample_count = np.array([len(np.where(y_train==t)[0]) for t in np.unique(y_train)])
weight = 1. / class_sample_count
samples_weight = np.array([weight[t] for t in y_train])

samples_weight = torch.from_numpy(samples_weight)
sampler = WeightedRandomSampler(samples_weight.type('torch.DoubleTensor'), len(samples_weight))

mb_size = 13
trainDataset = torch.utils.data.TensorDataset(torch.FloatTensor(X_train), torch.FloatTensor(y_train.astype(int)))

trainLoader = torch.utils.data.DataLoader(dataset = trainDataset, batch_size=mb_size, num_workers=1, sampler = sampler)

However, when I iterate for some epochs, there are cases of target vectors with all from the same class (either 0 or 1) which generates an error in my triplet loss function. What is the best way to handle this issue?
Is there a way to force the trainLoader to always load from both classes?

Thanks!

ptrblck · September 21, 2018, 8:23pm

If you really need samples from both classes, I would write an own Dataset and return a tuple containing both samples.
The WeightedRandomSampler uses probabilities to sample the classes, so some batches might be randomly get more samples of a certain class than the other.
Let me know, if you get stuck with your Dataset.

hosseinshn · September 21, 2018, 9:14pm

Thanks a lot! I’ll try to create a new Dataset as you suggested.