Why do we use RMSE instead of MSE?

Why do we calculate square root of MSE since minimizing MSE is the same as minimizing RMSE ? Is it because of numerical stability or something ? Or to avoid exploding gradient which can result from bigger loss function values?

Hello Serdar!

First a comment: I would lean towards using the mean-squared-error
(MSE), as it is a “more natural” measure of error (whatever that might
mean).

(Just to be clear “RMSE” is an acronym for “root-mean-squared-error”,
and is equal to sqrt (MSE).)

Now for some concrete technical differences:

Consider a single variable, x, and minimizing x**2 with respect to
x using gradient descent. Note that sqrt (x**2) = abs (x). x**2
is the one-dimensional version of MSE, and abs (x) = sqrt (x**2)
is the one-dimensional version of RMSE.

Both x**2 and abs (x) are minimized when x = 0 (at which point
both equal zero). The gradient of x**2 is “softer” in that is gets
smaller (and approaches zero) as x gets closer to 0. In contrast,
the gradient of abs (x) is always either +1 or -1, and doesn’t
change in magnitude as x approaches zero.

When x is large (greater than 1/2), x**2 will have the larger gradient,
and, using gradient descent, drive you towards the minimum at zero
more rapidly. But when x is small, abs (x) will have the larger
gradient.

Unless you expect x to start out very large, you might expect
minimization of abs (x) to proceed more rapidly because its
gradient doesn’t get smaller. On the other hand, because the
magnitude of its gradient stays the same, once near the x = 0
minimum, you might expect gradient descent to jump back and
forth from positive x to negative x back to positive x, and so on,
without making further progress towards x = 0.

So … Pick your poison.

(All of these effects can be addressed to some degree by using
variants of plain-vanilla gradient descent, such as adding momentum,
using an optimizer such as Adam, and/or using a learning-rate
scheduler.)

Of course, the realistic case of using either MSE or RMSE as the
loss function to be applied to the output of a complicated network
is much more involved, but, at some level, the above comments still
apply.

Best.

K. Frank

4 Likes

That is a very outstanding answer!

  • The RMSE is an indication of the noise levels in the scale of standard deviations.
  • The RMSE has nice mathematical properties for fast calculations (Its gradient is linear and propagates easily).