Assume there is a non-differentiable nn.Parameter in an equation whose gradient needs to be estimated using a straight-through estimator (STE) before the parameter can be updated. For example,

y = 0.5(|x| - |x - alpha| + alpha) y_q = round(y * (2 ** k - 1) / alpha) * alpha / (2 ** k - 1).

In this equation, alpha is a trainable parameter and the derivative of y_q w.r.t alpha needs to be estimated using an STE.

How can I define the approximated gradient and use it in an optimizer to update the parameter along with the rest of parameters in the network?

Note that it should likely be **, not ^.
I like the trick of y_q_diffable = y + (y_q - y).detach(). (In fact, I once proposed to give a lightning talk just on this line of code and whereit is useful.)

I always like to credit @hughperkins for sharing the trick here on the forums when he seen it in a paper and he knows a ton references for applications, too.

No, I didnâ€™t, but my favourite application that I use in my autograd course is to emulate quantization aware training with it. The course are not freely available, but the particular example is also included in the ACDL â€śAdvanced introduction to PyTorchâ€ť talk of which I published the slides. I donâ€™t know of any video recording and there wasnâ€™t enough interest to re-record it back then.