I would recommend to try out your architecture on a broader set of datasets (e.g. ImageNet) and check the performance, since the mentioned dataset might be too small and limited to claim a superior general model architecture.
Hi @ptrblck, that was my first thought. But to train the datasets such as Imagenet and JFT, I neither have Such kind of storage nor computing power to process, or even the money to pay for resources for AWS or GCP.
I’m an undergrad student who’s staying at home due to lockdown and using Google colab for my work. I cannot afford AWS or GCP.
That’s understandable.
If you don’t have the computing resources, you could try to take a look at the model analytically and perform some experiments in this direction. E.g. the Capsule Network paper used MNIST, CIFAR10, smallNORB, and SVHN for their initial experiments, but added a theoretical reasoning why this architecture works.
Maybe you could try a similar approach.
Just to add to @ptrblck comments – coming from a university research perspective, although not in image/video processing.
Deep Learning is both data and resource-hungry. This is why on many top papers, at least one author is from a company like Microsoft, Facebook, etc. That means, being smart brings you in many cases only so far. AI is also a very crowded field and CVPR is extremely(!) difficult to get into. And the works that do come from very strong groups that have also the required resources.
I’m really not trying to question your capabilities, but you might want to bit more pragmatic with your goals. For example, the paper that @ptrblck links is from the Hinton group. And Hinton won the Turing Award for his work in Deep Learning.
By they way according to my image/video colleagues, a “cat vs. dog” classifier should do around 99% at least these days :).