Loss at first epoch is very high but back to normal from second epoch and so on

This is one discussion I could find. There I describe a use case where the original target was in [0, 96] and where scaling the target range to [0, 1] during training (and unscaling it to the original range during prediction) worked better than trying to directly learn the target range.
I’m sure a proper weight and bias initialization would also work, but this was a quick way to let the model learn.