Which trick is used to numerically stabilize sleu?

Binbose · May 2, 2019, 7:13am

I tried to implement my own version of selu like this:

class MySelu(nn.Module):

    def __init__(self):
        super(MySelu, self).__init__()

    def forward(self, x):
        term1 = torch.clamp(x, min=0)
        term2 = torch.clamp(1.6732632423543772848170429916717 * (torch.exp(x) -1), max=0)
        return 1.0507009873554804934193349852946 * ( term1 + term2)

It turns out though, that I run quickly in nan in the backward pass when using this module. Using autograd anomaly detection I found out, that his is due to the exp(x) producing nan in the backward pass. When I look at the c implementation of selu in pytorch https://github.com/pytorch/pytorch/blob/3796cbaf7ab4c4c30cb99191d55a8b9c50b398dc/caffe2/operators/selu_op.cu it says “// Reuse Y[i] to avoid computing exp(X[i])”. What do they mean? Which trick do they use to stabilize the computation?