Hi, for a particular reason, I need to hand-derive a derivative for a GELU and it has to match the GELU implementation in Pytorch, but I tried finding the exact GELU formula and I’m a little confused. Could someone help simplify it for me by confirming whether the following is exactly equivalent to
0.5 * x * (1 + torch.tanh(np.math.sqrt(2 / np.math.pi) * (x + 0.044715 * torch.pow(x, 3)))
If not, could you provide the correct explicit formula? (Not for the derivative, just for the raw GELU)
Based on this test,
gelu should correspond to:
const auto y_exp = x * 0.5 * (1.0 + torch::erf(x / std::sqrt(2.0)));
torch::erf is probably used from the
std lib or the CUDA implementation.
I saw that elsewhere, but it doesn’t help me compute a derivative by hand because I don’t know the exact function that torch::erf takes.
The function can be found in the documentation for torch.erf().
More detail can be found in the wikipedia entry Error function.
As you can see, its derivative is just the probability density function
for the normal distribution.
Hmm, but how does Pytorch implement that integral? Is there no simple function equivalent to Pytorch’s implementation you could provide that I could enter into the following website:
I need to derive an MLP that uses GeLU activations and it would be convenient if I could just enter the full formula into that tool above.
erf() is a so-called special function. It is well studied, well
understood, and (reasonably) easy to calculate with modern
numerical analysis techniques.
But there is no “simple function equivalent” to it (other than
or things like the integral used to define it). So
erf() is the best
If you want to “compute a derivative by hand,” that’s easy, because
the derivative of
erf() is an elementary function, namely the
My pocket calculator doesn’t know how anything about
there’s no practical way for me to calculate
erf() with my calculator.
But python does know about
math.erf (1.234) works
just fine in python.
If the calculus tool you linked to knows about
erf(), then you should
be good. If it doesn’t, you’ll have to “compute a derivative by hand”
(or switch to a better calculus tool).
I’m confused, if there is no function equivalent, then how does Pytorch compute it? From reading about GeLU, it seems like they used an approximation of the erf function in the paper using tanh.