Why models for fine-grained image recognition will overfit on datasets of general image recognition?

Models for fine-grained image recognition, such as Bilinear-CNN, will
overfit on datasets of general image recognition like Cifar10. Why does this happen? It’s about model capacity or something?
Thanks in advance.