How to use residual learning applied to fully connected networks?

Here is my result of applying batch norm:

image

I think I’ll try SGD instead of Adam, but after that I’m at a loss as to what else to experiment with. I’m just getting much better and consistent results with a vanilla feed forward network. I only wish I knew if that was on my implementation or on the data.