The problem is:
I have LFW data set which contains about 2000 persons’ face, that is to say about 2000 classes. And each person contains about 2 images.
I use triple loss to train the net.but the problem is:
- We need to contain almost all the classes in the minibatch to train a good model right ?(I’m not sure).that is to say at least 4000 images in a minibatch.is possible in practice?Typically I with 64/32 as mini batch that’s wierd.
That is quite impossible, even for datasets with fewer targets/classes (like 1000 for ImageNet). Also, there are a few people advocating for not-too-large batch sizes like 32 and 64 like you said. Not only it fits in the memory of most consumer GPUs now, but it is also better for… reasons. I’ll let you read On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima for more details!
So in practice you would keep those 32/64 batch sizes, but for that kind of problem (most classification problems really), you’ll probably want to maximize the number of identities in a mini-batch, or even guarantee that there are no duplicate identities in the batch.
Some ideas to get you started:
- You can represent your dataset as lists of variable length for each identity, then shuffle those lists and their order randomly at each epoch and select samples from both the list of identities and each identity’s samples list.
- If the dataset does not fit in your memory (RAM, not GPU), this very random access will be very slow and training also. One option is to shuffle the dataset as previously but offline. It won’t be random but randomness is often impractical with large complex datasets.
You mean for each epoch we get a new dataset that has been shuffled and train on the new shuffled dataset which is equivalent to some way augment the data?
But there is a problem still confused me, that is
the total number of identities is too large says 4000.A minibatch can only contain say 32 identities.it’s ok?
maybe the gradient update will be too unstable to even converge?since each minibatch will update on different identities groups and may lead to totally different direction.
For example the first minibatch contains classes (“car”,“airplane”) the second will contain classes (“dog”,“cat”).there is no reason that two minibatch contains totally different group will converge to the same point.
It’s not really a “new dataset”, just the ordering is different! It might be related to data augmentation, in the sense that the mini-batches received by the network are different each time, even though the images themselves don’t change.
Indeed, the gradients won’t be the same, the convergence won’t be the same, but that’s the magic behind stochastic gradient descent! It’s not realistic to train on the whole dataset due to the deep networks used now and memory constraints, but online training with only one sample at each iteration also isn’t, because the gradients will be too different between successive iterations. SGD is a good tradeoff and has shown its performance for quite some time now
Does this mean that Network will be harder to train on the task that contains large amount of classes for example 4000 classes than on task only contains 10 classes.Since the variation of SGD may be greatly larger than the SGD with small classes.
Very probably, yes, but I suspect it’s not the only reason. I don’t know if the comparison is relevant, but even for a human, distinguishing 10 different things is way easier than 4000, after some learning…
Thank you for your nicely help