Should an MLP learn slower than a single linear layer?

agt · April 10, 2021, 10:44pm

I have a big architecture that uses a few MLPs like this: linear->ReLU->linear. After 30 epochs, it gets to a loss of around .5. But if I replace all the MLPs with single linear layers, it gets to a loss of around .06 in the same number of epochs. I am trying to fit a small set of test data.

Is this an issue? And why does it happen?