Hi all,
I would like to use the RMSE loss instead of MSE. From what I saw in pytorch documentation, there is no build-in function. Any ideas how this could be implemented?
Hi all,
I would like to use the RMSE loss instead of MSE. From what I saw in pytorch documentation, there is no build-in function. Any ideas how this could be implemented?
Wouldn’t it work, if you just call torch.sqrt() in nn.MSELoss?
x = Variable(torch.randn(5, 10), requires_grad=True)
y = Variable(torch.randn(5, 10))
criterion = nn.MSELoss()
loss = torch.sqrt(criterion(x, y))
loss.backward()
print(x.grad)
The solution of @ptrblck is the best I think (because the simplest one).
For the fun, you can also do the following ones:
# create a function (this my favorite choice)
def RMSELoss(yhat,y):
return torch.sqrt(torch.mean((yhat-y)**2))
criterion = RMSELoss
loss = criterion(yhat,y)
# create a nn class (just-for-fun choice :-)
class RMSELoss(nn.Module):
def __init__(self):
super().__init__()
self.mse = nn.MSELoss()
def forward(self,yhat,y):
return torch.sqrt(self.mse(yhat,y))
criterion = RMSELoss()
loss = criterion(yhat,y)
You should be careful with NaN which will appear if the mse=0. Something like this would probably be better :
class RMSELoss(nn.Module):
def __init__(self, eps=1e-6):
super().__init__()
self.mse = nn.MSELoss()
self.eps = eps
def forward(self,yhat,y):
loss = torch.sqrt(self.mse(yhat,y) + self.eps)
return loss
sqrt of 0 is 0, not nan
>>> torch.sqrt(torch.zeros(1))
tensor([0.])
Of course, the issue is during the backward pass as you multiply 0 by infinity (derivative of sqrt at 0).
>>> mse = nn.MSELoss()
>>> yhat = torch.zeros(1, requires_grad=True)
>>> y = torch.zeros(1)
>>> loss = torch.sqrt(mse(yhat,y))
>>> loss.backward()
>>> yhat.grad
tensor([nan])
Using the simple module I wrote above
>>> rmse = RMSELoss()
>>> yhat = torch.zeros(1, requires_grad=True)
>>> y = torch.zeros(1)
>>> loss = rmse(yhat,y)
>>> loss.backward()
>>> yhat.grad
tensor([0.])
Hi, I wonder if that’s exactly the same as RMSE when dealing with batch size more than 1 tensor.
i.e. target and prediction are [2,0,256,256] tensor
MSE_0 = MSE(prediction[0,:,:,:], target[0,:,:,:])
MSE_1 = MSE(prediction[1,:,:,:], target[2,:,:,:])
RMSE what we want is:
SQRT( MSE_0) + SQRT( MSE_1)
torch.sqrt(nn.MSELoss(x,y)) will give:
SQRT( MSE_0 + MSE_1)
so:
sqrt(M1+M2) is not equals to sqrt(M1) + sqrt(M2)
with reduction is even off, we wanna
Mean[ Mean (sqrt (MSE_0) ) + Mean(sqrt (MSE_1) ) ]
what will get with reduction = ‘mean’ instead, I think is:
sqrt (Mean(MSE_0) + Mean(MSE_1) )
so:
[sqrt(M1) / N + sqrt(M2)/N] /2 is not equals to sqrt (M1/N + M2/N)
please correct me if my understanding is wrong. Thanks 
Try to add eps, such as eps = 1e-8, according to your precision.,
This implementation is according to the definition of RMSE Error
class RMSELoss(nn.Module):
def __init__(self):
super().__init__()
self.eps=1e-6
def forward(self,ground_truth,prediction):
loss=torch.mean(torch.sqrt(torch.sum(torch.square(ground_truth-prediction),axis=-1))) + self.eps
return loss
So in summary one one should implement ptrblck’s implementation by simply taking a square root over the mean square error. However, you need to be aware of Nan resulting in the backward pass.
Hence incredibly important to follow YannDubs1’s advice and add a very small non-zero number like 1e-6 or even smaller like 1e-8. This does not effect neuron’s weights as this small number has vanishing nature aka neurons just ignore it.
So the simplest solution is to add following code to ptrblck’s implementation solution:
criterion = nn.MSELoss()
loss = torch.sqrt(criterion(x, y) + eps) # added eps to be 1e-6
Correct this Noob if wrong!
You should be warned that minimising RMSE, as @windson correctly assessed, is NOT the same as minimising MSE, and that one should always use MSE as loss function, because RMSE is just the square root of MSE computed over batches, which doesn’t make any sense if batches are randomized as in SGD.
If someone wanted to compute RMSE over single data points instead, that would just be absolute errors (as long as output is a scalar - because no averaging is actually performed) and therefore MAE is recommended.
Btw, I found this thread while searching for a method to safely record and log RMSE instead of MSE on the test set.
If you are using lightning, like I do, that method exists and is torchmetrics.regression.MeanSquaredError(squared= False). This can be used for training too, as the gradients will still be correctly computed on the MSE. This class will however accumulate the computed MSE and present the square root when logging it.