I am trying to optimize a model which has output range **[1, -1]** so, I know for the last layer I can not use `ReLU`

as it always gives positive output. But, I was wondering that , can using `ReLU`

in intermediate layers affect the model output as it will make each activation output greater than 1. Thus choosing other activation function such as `TanH`

will be more suitable case here. Or, may be Leaky ReLU. I guess, In case of `Leaky ReLU`

, setting the slope is a crucial hyperparameter. I tried all three activation functions, TanH optimizes over time, `ReLU`

model give `nan`

output after few epoch, `Leaky ReLU`

some time work some time gives output as `nan`

. Can someone have their thoughts on this topic.

Hi Rajat!

It is perfectly fine to use `ReLU`

as the activation for intermediate layers.

Even though the output will be greater than zero (I assume that “make

each activation output greater than 1” is a typo.), a subsequent `Linear`

layer can have negative weights that flip the sign of the positive activation

output and negative biases that shift the positive activation output down

below zero.

As a practical modeller, you might then want to use `Tanh`

– because it

works!

As for the `nan`

, `ReLU`

might still work, but you might start with a lower

learning rate as doing so can sometimes avoid the `nan`

s. One approach

is to start with a small learning rate so that your training can avoid `nan`

s

while the model’s training “settles down,” increase the learning rate for

a while so you make faster progress, and then lower the learning rate

again so you can “fine tune” the model’s parameters.

Best.

K. Frank

Like you said in subsequent layers the sign will flip out. But, I case of my model I am using ReLU at all layers except the last one which is TanH. So, I think model will have hard time to learn. Also, for Leaky ReLU the slope is crucial parameter.