Is tanh to ensure regression output is in [-1,1] a good idea?

I want to regress some coordinates from images and these coordinates were normalized to [-1,1] in my data set. When I just use a linear layer at the end of my network, I very often get NaN as output of my L1 or MSE loss right in the first episode. I guess this is because of exploding gradients or something similar. This happens both with learning rates 0.1 and 1.0. I am using SGD with momentum=0.9 and nesterov=True.

When I use tanh instead of just a linear layer in the end, this does not seem to happen. However, I am unsure if this is an optimal solution because when you look at the tanh graph, it is much more sensitive to changes that are close to x=0 than to changes at x > 1 or x < -1. So if I have output values close to 1 or -1, the input to tanh has to be very big. I hope I have explained my doubts well enough.

What are recommended methods to ensure that the regression output is always in a specific range, in my case [-1,1]?

Additional information: Before the regression layer in the end I have a lot of convolutions and deconvolutions, but I think that this should not be the problem here.

1 Like

The learning rates don’t mean anything to me, and 3e-4 or so isn’t uncommon either. But if you can deal with coordinates outside [-1,1], one option would be to use 1.5*tanh - you’d bound yourself to a reasonable range while not saturating.
Jeremy Howard’s fast.ai does something like that for predicting movielens ratings in his collaborative filtering lecture (or at least has done that in some version of the course).

Best regards

Thomas

No, I think just because the range is in within (-1, 1) does not make Tanh a good choice. The reason is the distribution of your output will be shifted more towards the two ends -1 and +1, so most of your outputs will be close to -1 and +1, while very few values will be close to 0. So, assuming that your target values are uniformly distributed within [-1, 1], Tanh will not be able to generate outputs that match the target distribution.

Thanks for your answer. Unfortunately, limiting to [-1,1] is required. Values outside of this range would mean that my network detects points outside of the image which makes no sense.

What do you mean by “The learning rates don’t mean anything to me”? That I should reduce the learning rate?

Yes, that is what I was thinking. What would you use instead of tanh?

All examples that I can find on the internet just use a linear layer for the regression. But when I do this, I have absolutely no guarantee about the range of output values.

You can search for bounded regression to solve that. Bounded regression is an area of research by itself, and there are a some proposed models for it in the literature. This answer on stack-exchange may also be useful: https://stats.stackexchange.com/questions/11985/how-to-model-bounded-target-variable

Thanks! Another idea that I had was to add a custom loss that is 0 when the values are in [-1,1] and that grows very fast when the values are outside that range. This would still not guarantee that my model will predict correct values in every case, but it will make it less likely.
Then I could just add a postprocessing step like this: predicted_value = max(-1, min(1, predicted_value))

I will give it a try.

For everybody who reads this and is looking for a good solution: Try “Differentiable Spatial to Numerical Transform” (DSNT)

Code: https://github.com/anibali/dsntnn
Paper: https://arxiv.org/pdf/1801.07372.pdf

This outperforms both the tanh and linear solution by far in my case. And it is really easy to use and adds no learnable parameters to your model.

1 Like