I’m training a deep neural network with N hidden layers. But I found that both train and test accuracy got worse when N becomes large (which means more hidden layers)
As I know, when neural network becomes deeper, model’s performance may become worse due to gradient vanish/explode, which means layers close to input layer can’t update weights stably because of little/large value of grad.
But after I check each layer’s grad, I found grad in deeper layer is smaller than shallow layer, e.g. 10^-5 in the first hidden layer and 10^-10 in the last hidden layer, which is different to my thought.
Is ther any misunderstanding to my thought? Or actually there is another reason due to the worse result when my model goes deeper? Thanks.