How to Prevent Overfitting

first off, you wouldn’t shuffle your testloader.

Here’s an example:

batch_size = 20
class_sample_count = [10, 1, 20, 3, 4] # dataset has 10 class-1 samples, 1 class-2 samples, etc.
weights = 1 / torch.Tensor(class_sample_count)
sampler = torch.utils.data.sampler.WeightedRandomSampler(weights, batch_size)
trainloader = data_utils.DataLoader(train_dataset, batch_size = batch_size, shuffle=True, sampler = sampler)
32 Likes

Awesome, thanks!

Why should you not shuffle your testloader?

hi @smth, how to use weight sampler in trainloader automatically?
I mean we can count **class_sample_count ** in init step of dataset.
for example, in init function of ImageFolder(data.Dataset).

at test time, random shuffling does not affect model performance.
there is no need to do so.

@smth
Hmm, I tried the code above, but got the following error:

Command:

batch_size = 20
class_sample_count = [10, 5, 2, 1] 
weights = (1 / torch.Tensor(class_sample_count))
sampler = torch.utils.data.sampler.WeightedRandomSampler(weights, batch_size)

Error:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python2.7/dist-packages/torch/utils/data/sampler.py", line 81, in __init__
    self.weights = torch.DoubleTensor(weights)
TypeError: torch.DoubleTensor constructor received an invalid combination of arguments - got (torch.FloatTensor), but expected one of:
 * no arguments
 * (int ...)
      didn't match because some of the arguments have invalid types: (torch.FloatTensor)
 * (torch.DoubleTensor viewed_tensor)
      didn't match because some of the arguments have invalid types: (torch.FloatTensor)
 * (torch.Size size)
      didn't match because some of the arguments have invalid types: (torch.FloatTensor)
 * (torch.DoubleStorage data)
      didn't match because some of the arguments have invalid types: (torch.FloatTensor)
 * (Sequence data)
      didn't match because some of the arguments have invalid types: (torch.FloatTensor)

before sampler =, add: weights = weights.double()

4 Likes

@smth
Ok, I added that, and tried running the code below, but am getting some odd/erratic results for Top1 accuracy for testset and trainset.

Testset Top1 accuracy tops off at around 20% and Trainset Top1 accuracy hits 100%-bizarre.

It also does not seem to train on all of the data for each epoch-it progresses to the next epoch quicker than expected.

When I take out the WeightedRandomSampler, I get normal results again.

Also, is it correct to use the trainloader with the sampler for obtaining the trainset accuracy?

ERROR PyTorch 0.1.10

You should strongly consider data augmentation in some meaningful way. If you’re attempting to do classification then think about what augmentations might add useful information and help distinguish classes in your dataset. In one of my cases, introducing background variation increased recognition rate by over 50%. Basically, with small datasets there is too much overfitting so you want the network to learn real-world distinctions vs. irrelevant artifacts like backgrounds / shadows etc.

Alternatively, as Andrew Ng said: “get more data” :slight_smile: <-- seems to be the cure to everything.

2 Likes

ERROR PyTorch 0.1.11

Yeah from what I understand, you will get better accuracy if your class sizes are close to equal. I think varying data augmentation could help or culling from the largest classes or using synthetic data to fill in some of the smaller sizes.

Shouldn’t the second argument of sampler be the total number of samples of training set? When I set it to batch_size, it only runs for one batch every epoch. @smth

7 Likes

@wangg12
@smth
I also had this issue-it seemed to only run for one batch every epoch. Did you find a way to correct this?

@nikmentenson I don’t know the correct way. The doc is too hard to understand. At least the num of examples should be the number of examples in total dataset.

Perhaps @apaszke can help?

Yes, I could not understand the documentation either.

I think the correct way to use WeightedRandomSampler in your case is to initialize weights such that

prob = [0.7, 0.3, 0.1] # probability of class 1 = 0.7, of 2 = 0.3 etc
# class[i] = list containing class present at index i in the dataset  
for index in len(dataset):
    reciprocal_weights[index] = prob[class[index]]

weights = (1 / torch.Tensor(reciprocal_weights))
sampler = torch.utils.data.sampler.WeightedRandomSampler(weights, len(dataset))

I went through the sampler.WeightedRandomSampler source, and it just simply returns an iterator weighted with multinomial distribution ( with degree = len(weights)). Therefore to sample the entire dataset for one epoch and weigh your samples inversely to your class appearing probability, the weights should be as long as the size of the dataset, with each index having weight according to the class at that index.

26 Likes

met similar problem. Seems something wrong in WeightedRandomSampler.

One more question, seems WeightedRandomSampler is similar to the weight parameter in nn.CrossEntropyLoss. Which one do you suggest to use? @smth

Thanks!

Can you share some code as how you do background variation for images?

Thanks so much!

Unfortunately I can’t as it is pretty specific to my project. But a good way to approach it would be to use OpenCV or something similar as it has a ton of image manipulation algorithms.

i also encounter the same problem as @wangg12, using the above code results in running train iteration on a single batch @smth. the docs are also not clear for how to use WeightedRandomSampler with Dataloader.

@Chahrazad all samplers are used in a consistent way.

You first create a sampler object, for example, let’s say you have 10 samples in your Dataset.

dataset_length = 10
epoch_length = 100 # each epoch sees 100 draws of samples
sample_probabilities = torch.randn(dataset_length)
weighted_sampler = torch.utils.data.sampler.WeightedRandomSampler(sample_probabilities, epoch_length)
torch.utils.data.DataLoader(...., sampler=weighted_sampler)
3 Likes