I tried to implement my own version of selu like this:
class MySelu(nn.Module):
def __init__(self):
super(MySelu, self).__init__()
def forward(self, x):
term1 = torch.clamp(x, min=0)
term2 = torch.clamp(1.6732632423543772848170429916717 * (torch.exp(x) -1), max=0)
return 1.0507009873554804934193349852946 * ( term1 + term2)
It turns out though, that I run quickly in nan in the backward pass when using this module. Using autograd anomaly detection I found out, that his is due to the exp(x) producing nan in the backward pass. When I look at the c implementation of selu in pytorch https://github.com/pytorch/pytorch/blob/3796cbaf7ab4c4c30cb99191d55a8b9c50b398dc/caffe2/operators/selu_op.cu it says “// Reuse Y[i] to avoid computing exp(X[i])”. What do they mean? Which trick do they use to stabilize the computation?