Hello,
Let me start by saying I’ve searched, and searched and then searched some more. Not only on PyTorch but on Github and other sources.
I believe what confused me is the fact I blindly copied samples from this forum.
I have read topics and posts such as:
There seem to be three issues I would really appreciate clarity on:
- Does the
WeightSampler
take as argument the length of the training set, or the batch size?
The behaviour changes dramatically if I (wrongly I presume) pass the batch size - How exactly should I calculate the weights for the
WeightSampler
? I’ve seen snippets from other users and from @ptrblck and they all vary to the point where I think this has made me calculate them wrongly (see the snippet I’m posting below which is the actual code I use). - If I use a
WeightedSampler
does theCE Loss
criterion and the way it is back-propagated change? I seen a snippet from here (again from @ptrblck) which suggests that in the event of using a WeightedSampler with CE Loss, then the backprop should be different? Furthermore in the same snippet, constructing thenn.CrossEntropyLoss
object seems to take as parameters the actual weights?
The way I calculate the actual weights. Method class_samples()
returns a list where each class/label has its rows in the dataset counted. Order is preserved and I have manually verified this.
Most examples I have seen stop at line #3 and don’t carry on with the cat
operation.
EDIT I tried removing the bottom three lines (torch.cat
and such) and the network Loss drops to zero but Top-1 gets stuck at 50% so I am assuming this is the correct approach creating weight samples.
sample_index = dataset.class_samples()
num_samples = len(dataset)
classes_weight = 1. / torch.tensor(sample_index, dtype=torch.float)
target = torch.cat((torch.zeros(int(num_samples * 0.99), dtype=torch.long),
torch.ones(int(num_samples * 0.01), dtype=torch.long)))
samples_weight = torch.tensor([classes_weight[t] for t in target])
The remaining code is more or less templated:
- I use a ResNet34 for single class classification of PDF forms (uploaded by users) which are very well structured most of the time
- I have a highly unbalanced dataset, some classes have 4000 samples, some only have 40, hence the use of the
WeightedSampler
- I get very good Top-1 accuracy which is reproducible (99% to 100%) however when doing cross-evaluation the SoftMax’ed output finds and approximates similarity between
unseen
input andknown
input when it really shouldn’t. There seems to be a high variability (more than 30%) of the Top-1 when testing withunseen
input. I would assume this is too much generalisation/approximation whilst the network reports over-fitting, which makes no sense to me.
Any help would be greatly appreciated, and hopefully would put the matter to rest.