I am trying to optimize a model which has output range [1, -1] so, I know for the last layer I can not use ReLU
as it always gives positive output. But, I was wondering that , can using ReLU
in intermediate layers affect the model output as it will make each activation output greater than 1. Thus choosing other activation function such as TanH
will be more suitable case here. Or, may be Leaky ReLU. I guess, In case of Leaky ReLU
, setting the slope is a crucial hyperparameter. I tried all three activation functions, TanH optimizes over time, ReLU
model give nan
output after few epoch, Leaky ReLU
some time work some time gives output as nan
. Can someone have their thoughts on this topic.
Hi Rajat!
It is perfectly fine to use ReLU
as the activation for intermediate layers.
Even though the output will be greater than zero (I assume that “make
each activation output greater than 1” is a typo.), a subsequent Linear
layer can have negative weights that flip the sign of the positive activation
output and negative biases that shift the positive activation output down
below zero.
As a practical modeller, you might then want to use Tanh
– because it
works!
As for the nan
, ReLU
might still work, but you might start with a lower
learning rate as doing so can sometimes avoid the nan
s. One approach
is to start with a small learning rate so that your training can avoid nan
s
while the model’s training “settles down,” increase the learning rate for
a while so you make faster progress, and then lower the learning rate
again so you can “fine tune” the model’s parameters.
Best.
K. Frank
Like you said in subsequent layers the sign will flip out. But, I case of my model I am using ReLU at all layers except the last one which is TanH. So, I think model will have hard time to learn. Also, for Leaky ReLU the slope is crucial parameter.