Why do neural networks with more layers perform better than a single layer?

Why do neural networks with more layers perform better than a single layer MLP with a number of neurons that leads to the same number of parameters?

I read this post:

https://www.quora.com/Why-do-neural-networks-with-more-layers-perform-better-than-a-single-layer-MLP-with-a-number-of-neurons-that-leads-to-the-same-number-of-parameters

and still I’m not sure it always true.

For example:

Assume with have model with 2 linear layers (for simplicity with no bias) and RELU as the final layer, the model look:

   RELU (w1x + w2x)

and the number of parameters to optimize is (w1 + w2)

We can see that:

  RELU (w1x + w2x) =  RELU ((x(w1 + w2)) = RELU (w3x)

so w3 = w1+w2

i.e second model with 1 linear layers has less parameters to optimize.

  1. In my example is it still better to use one layer or 2 layers ?
  2. Am I right that the second model (with w3) has less parameters?
  3. Is it easier to optimize the second model (w3) ?

I’m not sure how to interpret your “model” definition:

RELU (w1x + w2x)

as it seems you are trying to sum the outputs of both layers, which is unusual.
Assuming you want to use two consecutive linear layers with a final ReLU, you model would be defined as:

out = relu(w2 @ (w1 @ x))

In this case you are right and the two linear layers can be seen as a single one, since no activation function was used.
However, collapsing these layers is not possible if you are using activation functions, as would be the common approach:

out = relu(w2 @ relu(w1 @ x))
1 Like