Why do we use RMSE instead of MSE?

KFrank · October 12, 2020, 8:25pm

Hello Serdar!

First a comment: I would lean towards using the mean-squared-error
(MSE), as it is a “more natural” measure of error (whatever that might
mean).

(Just to be clear “RMSE” is an acronym for “root-mean-squared-error”,
and is equal to sqrt (MSE).)

Now for some concrete technical differences:

Consider a single variable, x, and minimizing x**2 with respect to
x using gradient descent. Note that sqrt (x**2) = abs (x). x**2
is the one-dimensional version of MSE, and abs (x) = sqrt (x**2)
is the one-dimensional version of RMSE.

Both x**2 and abs (x) are minimized when x = 0 (at which point
both equal zero). The gradient of x**2 is “softer” in that is gets
smaller (and approaches zero) as x gets closer to 0. In contrast,
the gradient of abs (x) is always either +1 or -1, and doesn’t
change in magnitude as x approaches zero.

When x is large (greater than 1/2), x**2 will have the larger gradient,
and, using gradient descent, drive you towards the minimum at zero
more rapidly. But when x is small, abs (x) will have the larger
gradient.

Unless you expect x to start out very large, you might expect
minimization of abs (x) to proceed more rapidly because its
gradient doesn’t get smaller. On the other hand, because the
magnitude of its gradient stays the same, once near the x = 0
minimum, you might expect gradient descent to jump back and
forth from positive x to negative x back to positive x, and so on,
without making further progress towards x = 0.

So … Pick your poison.

(All of these effects can be addressed to some degree by using
variants of plain-vanilla gradient descent, such as adding momentum,
using an optimizer such as Adam, and/or using a learning-rate
scheduler.)

Of course, the realistic case of using either MSE or RMSE as the
loss function to be applied to the output of a complicated network
is much more involved, but, at some level, the above comments still
apply.

Best.

K. Frank