Why do we calculate square root of MSE since minimizing MSE is the same as minimizing RMSE ? Is it because of numerical stability or something ? Or to avoid exploding gradient which can result from bigger loss function values?

Hello Serdar!

First a comment: I would lean towards using the mean-squared-error

(MSE), as it is a â€śmore naturalâ€ť measure of error (whatever that might

mean).

(Just to be clear â€śRMSEâ€ť is an acronym for â€śroot-mean-squared-errorâ€ť,

and is equal to `sqrt (MSE)`

.)

Now for some concrete technical differences:

Consider a single variable, `x`

, and minimizing `x**2`

with respect to

`x`

using gradient descent. Note that `sqrt (x**2) = abs (x)`

. `x**2`

is the one-dimensional version of MSE, and `abs (x) = sqrt (x**2)`

is the one-dimensional version of RMSE.

Both `x**2`

and `abs (x)`

are minimized when `x = 0`

(at which point

both equal zero). The gradient of `x**2`

is â€śsofterâ€ť in that is gets

smaller (and approaches zero) as `x`

gets closer to 0. In contrast,

the gradient of `abs (x)`

is always either `+1`

or `-1`

, and doesnâ€™t

change in magnitude as `x`

approaches zero.

When `x`

is large (greater than 1/2), `x**2`

will have the larger gradient,

and, using gradient descent, drive you towards the minimum at zero

more rapidly. But when `x`

is small, `abs (x)`

will have the larger

gradient.

Unless you expect `x`

to start out very large, you might expect

minimization of `abs (x)`

to proceed more rapidly because its

gradient doesnâ€™t get smaller. On the other hand, because the

magnitude of its gradient stays the same, once near the `x = 0`

minimum, you might expect gradient descent to jump back and

forth from positive `x`

to negative `x`

back to positive `x`

, and so on,

without making further progress towards `x = 0`

.

So â€¦ Pick your poison.

(All of these effects can be addressed to some degree by using

variants of plain-vanilla gradient descent, such as adding momentum,

using an optimizer such as `Adam`

, and/or using a learning-rate

scheduler.)

Of course, the realistic case of using either MSE or RMSE as the

loss function to be applied to the output of a complicated network

is much more involved, but, at some level, the above comments still

apply.

Best.

K. Frank

That is a very outstanding answer!

- The RMSE is an indication of the noise levels in the scale of standard deviations.
- The RMSE has nice mathematical properties for fast calculations (Its gradient is linear and propagates easily).