I want to regress some coordinates from images and these coordinates were normalized to [-1,1] in my data set. When I just use a linear layer at the end of my network, I very often get NaN as output of my L1 or MSE loss right in the first episode. I guess this is because of exploding gradients or something similar. This happens both with learning rates 0.1 and 1.0. I am using SGD with momentum=0.9 and nesterov=True.

When I use tanh instead of just a linear layer in the end, this does not seem to happen. However, I am unsure if this is an optimal solution because when you look at the tanh graph, it is much more sensitive to changes that are close to x=0 than to changes at x > 1 or x < -1. So if I have output values close to 1 or -1, the input to tanh has to be very big. I hope I have explained my doubts well enough.

What are recommended methods to ensure that the regression output is always in a specific range, in my case [-1,1]?

Additional information: Before the regression layer in the end I have a lot of convolutions and deconvolutions, but I think that this should not be the problem here.

The learning rates donâ€™t mean anything to me, and 3e-4 or so isnâ€™t uncommon either. But if you can deal with coordinates outside [-1,1], one option would be to use 1.5*tanh - youâ€™d bound yourself to a reasonable range while not saturating.
Jeremy Howardâ€™s fast.ai does something like that for predicting movielens ratings in his collaborative filtering lecture (or at least has done that in some version of the course).

No, I think just because the range is in within (-1, 1) does not make Tanh a good choice. The reason is the distribution of your output will be shifted more towards the two ends -1 and +1, so most of your outputs will be close to -1 and +1, while very few values will be close to 0. So, assuming that your target values are uniformly distributed within [-1, 1], Tanh will not be able to generate outputs that match the target distribution.

Thanks for your answer. Unfortunately, limiting to [-1,1] is required. Values outside of this range would mean that my network detects points outside of the image which makes no sense.

What do you mean by â€śThe learning rates donâ€™t mean anything to meâ€ť? That I should reduce the learning rate?

All examples that I can find on the internet just use a linear layer for the regression. But when I do this, I have absolutely no guarantee about the range of output values.

Thanks! Another idea that I had was to add a custom loss that is 0 when the values are in [-1,1] and that grows very fast when the values are outside that range. This would still not guarantee that my model will predict correct values in every case, but it will make it less likely.
Then I could just add a postprocessing step like this: predicted_value = max(-1, min(1, predicted_value))