# Why do we use RMSE instead of MSE?

Why do we calculate square root of MSE since minimizing MSE is the same as minimizing RMSE ? Is it because of numerical stability or something ? Or to avoid exploding gradient which can result from bigger loss function values?

Hello Serdar!

First a comment: I would lean towards using the mean-squared-error
(MSE), as it is a â€śmore naturalâ€ť measure of error (whatever that might
mean).

(Just to be clear â€śRMSEâ€ť is an acronym for â€śroot-mean-squared-errorâ€ť,
and is equal to `sqrt (MSE)`.)

Now for some concrete technical differences:

Consider a single variable, `x`, and minimizing `x**2` with respect to
`x` using gradient descent. Note that `sqrt (x**2) = abs (x)`. `x**2`
is the one-dimensional version of MSE, and `abs (x) = sqrt (x**2)`
is the one-dimensional version of RMSE.

Both `x**2` and `abs (x)` are minimized when `x = 0` (at which point
both equal zero). The gradient of `x**2` is â€śsofterâ€ť in that is gets
smaller (and approaches zero) as `x` gets closer to 0. In contrast,
the gradient of `abs (x)` is always either `+1` or `-1`, and doesnâ€™t
change in magnitude as `x` approaches zero.

When `x` is large (greater than 1/2), `x**2` will have the larger gradient,
and, using gradient descent, drive you towards the minimum at zero
more rapidly. But when `x` is small, `abs (x)` will have the larger

Unless you expect `x` to start out very large, you might expect
minimization of `abs (x)` to proceed more rapidly because its
gradient doesnâ€™t get smaller. On the other hand, because the
magnitude of its gradient stays the same, once near the `x = 0`
minimum, you might expect gradient descent to jump back and
forth from positive `x` to negative `x` back to positive `x`, and so on,
without making further progress towards `x = 0`.

(All of these effects can be addressed to some degree by using
using an optimizer such as `Adam`, and/or using a learning-rate
scheduler.)

Of course, the realistic case of using either MSE or RMSE as the
loss function to be applied to the output of a complicated network
is much more involved, but, at some level, the above comments still
apply.

Best.

K. Frank

3 Likes

That is a very outstanding answer!

• The RMSE is an indication of the noise levels in the scale of standard deviations.
• The RMSE has nice mathematical properties for fast calculations (Its gradient is linear and propagates easily).