Sigmoid Activation Function for regression task


I have built a neural network aiming to predict 5 continuous values from video samples in the range between 0 and 1. For the last activation, I used the Sigmoid Activation function and as a criterion the MSE loss. Is it good for both the choices?

Thanks in advance for the help.

Hi Gianluca!

MSELoss is usually the right choice for regression. I would recommend
that you always start with MSELoss and only use something different if
you have good reason and can show that it works better.

As for the Sigmoid, I would not use it, even though your target values
are in the range [0.0, 1.0]. It is true that Sigmoid maps the real line
(that is (-inf, inf)) to (0.0, 1.0), so it might seem natural, however,
this is probably an illusion.

If your target (ground truth) values can be close to (or equal to) 0.0 and
1.0, then the output of your network (before passing it through Sigmoid)
would have to be a very large negative number (for a target close to 0.0)
or a very large positive number (for a target close to 1.0), which would
be hard for your network to learn.

You can experiment with Sigmoid if you want, but you should only
actually use it if you can show that it works better than leaving it out.

(I could think of use cases where your would want the Sigmoid, but
they would be contrived, or at least very atypical.)


K. Frank

Thank you very much for the answer and for the clarification. I have just another question. When you say “…it works better than leaving it out”, do you mean to consider just the logits coming out from the last FC layer? Or it is better to substitute the sigmoid with another activation function as ReLU?

Hi Gianluca!

Yes, I was speaking only about whether or not you should have Sigmoid
after your final Linear layer. I recommend using the output of your final
Linear layer as your predictions and feeding them directly to MSELoss.

I was not talking about the non-linear activations between various layers.
Having said that, some of the lore suggests that ReLU is to be preferred
over Sigmoid (but Sigmoid is a perfectly reasonable non-linear activation).


K. Frank

Thank you Frank your suggestions were of huge help to me.
I’ll remove the sigmoid in the last layer on the final layer then.

Kind regards.