Right, I switched from using a pretrained (on Imagenet) Resnet50 to a Resnet18, and that lowered the overfitting, so that my trainset Top1 accuracy is now around 58% (down from 69%).
Would increasing data augmentation rate (from 16x to say 96x) decrease overfitting further?
One other thing about the nature of my dataset-there is a severe class size imbalance as well across the 110 classes-could this be contributing to the overfitting as well? What would be a good solution to this then-perhaps clustering the smaller classes into aggregate classes to match the size of the larger classes? Then perhaps a second net could be used where the aggregate classes are then broken down back into their original classes for more fine-grained classification.
Lastly, what would you say is a reasonable amount of overfitting in terms of the trainset-testset difference in accuracy measurement? 5%?
hi @smth, how to use weight sampler in trainloader automatically?
I mean we can count **class_sample_count ** in init step of dataset.
for example, in init function of ImageFolder(data.Dataset).
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python2.7/dist-packages/torch/utils/data/sampler.py", line 81, in __init__
self.weights = torch.DoubleTensor(weights)
TypeError: torch.DoubleTensor constructor received an invalid combination of arguments - got (torch.FloatTensor), but expected one of:
* no arguments
* (int ...)
didn't match because some of the arguments have invalid types: (torch.FloatTensor)
* (torch.DoubleTensor viewed_tensor)
didn't match because some of the arguments have invalid types: (torch.FloatTensor)
* (torch.Size size)
didn't match because some of the arguments have invalid types: (torch.FloatTensor)
* (torch.DoubleStorage data)
didn't match because some of the arguments have invalid types: (torch.FloatTensor)
* (Sequence data)
didn't match because some of the arguments have invalid types: (torch.FloatTensor)
You should strongly consider data augmentation in some meaningful way. If you’re attempting to do classification then think about what augmentations might add useful information and help distinguish classes in your dataset. In one of my cases, introducing background variation increased recognition rate by over 50%. Basically, with small datasets there is too much overfitting so you want the network to learn real-world distinctions vs. irrelevant artifacts like backgrounds / shadows etc.
Alternatively, as Andrew Ng said: “get more data” <-- seems to be the cure to everything.
Yeah from what I understand, you will get better accuracy if your class sizes are close to equal. I think varying data augmentation could help or culling from the largest classes or using synthetic data to fill in some of the smaller sizes.
Shouldn’t the second argument of sampler be the total number of samples of training set? When I set it to batch_size, it only runs for one batch every epoch. @smth
@nikmentenson I don’t know the correct way. The doc is too hard to understand. At least the num of examples should be the number of examples in total dataset.
I think the correct way to use WeightedRandomSampler in your case is to initialize weights such that
prob = [0.7, 0.3, 0.1] # probability of class 1 = 0.7, of 2 = 0.3 etc
# class[i] = list containing class present at index i in the dataset
for index in len(dataset):
reciprocal_weights[index] = prob[class[index]]
weights = (1 / torch.Tensor(reciprocal_weights))
sampler = torch.utils.data.sampler.WeightedRandomSampler(weights, len(dataset))
I went through the sampler.WeightedRandomSampler source, and it just simply returns an iterator weighted with multinomial distribution ( with degree = len(weights)). Therefore to sample the entire dataset for one epoch and weigh your samples inversely to your class appearing probability, the weights should be as long as the size of the dataset, with each index having weight according to the class at that index.