Choice of activation function

I am trying to optimize a model which has output range [1, -1] so, I know for the last layer I can not use ReLU as it always gives positive output. But, I was wondering that , can using ReLU in intermediate layers affect the model output as it will make each activation output greater than 1. Thus choosing other activation function such as TanH will be more suitable case here. Or, may be Leaky ReLU. I guess, In case of Leaky ReLU, setting the slope is a crucial hyperparameter. I tried all three activation functions, TanH optimizes over time, ReLU model give nan output after few epoch, Leaky ReLU some time work some time gives output as nan. Can someone have their thoughts on this topic.

Hi Rajat!

It is perfectly fine to use ReLU as the activation for intermediate layers.
Even though the output will be greater than zero (I assume that “make
each activation output greater than 1” is a typo.), a subsequent Linear
layer can have negative weights that flip the sign of the positive activation
output and negative biases that shift the positive activation output down
below zero.

As a practical modeller, you might then want to use Tanh – because it

As for the nan, ReLU might still work, but you might start with a lower
learning rate as doing so can sometimes avoid the nans. One approach
is to start with a small learning rate so that your training can avoid nans
while the model’s training “settles down,” increase the learning rate for
a while so you make faster progress, and then lower the learning rate
again so you can “fine tune” the model’s parameters.


K. Frank

Like you said in subsequent layers the sign will flip out. But, I case of my model I am using ReLU at all layers except the last one which is TanH. So, I think model will have hard time to learn. Also, for Leaky ReLU the slope is crucial parameter.