Is this operation done on CPU or GPU? One potential for optimization on GPU here is writing a custom kernel/extension: Custom C++ and CUDA Extensions — PyTorch Tutorials 1.9.0+cu102 documentation to fuse the operations together, because as it is written, the activation function incurs many global memory read/writes. From a quick glance, x is read, negated, written back, read again to add bias, written back, read again to negate, written back, read again to add bias, and finally written back to produce the output. With a single kernel, this could be optimized to a single read/write. However, in addition to requiring a new kernel/CUDA extension to be written you would also need to implement your own backwards kernel for training as the new operation would not be registered in autograd.
If this works for you use case, please update this thread with your performance results! It could be a great example for others also trying to optimize similar operators/kernels.