I have a big architecture that uses a few MLPs like this: linear->ReLU->linear. After 30 epochs, it gets to a loss of around .5. But if I replace all the MLPs with single linear layers, it gets to a loss of around .06 in the same number of epochs. I am trying to fit a small set of test data.
Is this an issue? And why does it happen?