Custom activation function is slow


I’m new here and i am writing custom activation function with learnable parameters:

How it looks:

The formula:
x[ x < 0 ] = x[ x < 0 ] * alpha_low
x[ x > bias ] = x[ x > bias ] * alpha_high + bias

I tried several ways implementing it and here is the fastest result:

class CustomRelu(nn.Module):
	def __init__(self, alpha_low=0.01, alpha_high=0.01, bias=1.):
		super(CustomRelu, self).__init__()
		self.prelu_low = nn.PReLU(init=alpha_low)
		self.prelu_high = nn.PReLU(init=alpha_high)
		self.bias = bias
	def forward(self, x):
		x= -self.prelu_low(x) + self.bias
		x= -self.prelu_high(x) + self.bias
		return x

But the problem is, its still much slower than built in activation functions (about 5-10 times slower)
Is there a way how to optimize it?

Is this operation done on CPU or GPU? One potential for optimization on GPU here is writing a custom kernel/extension: Custom C++ and CUDA Extensions — PyTorch Tutorials 1.9.0+cu102 documentation to fuse the operations together, because as it is written, the activation function incurs many global memory read/writes. From a quick glance, x is read, negated, written back, read again to add bias, written back, read again to negate, written back, read again to add bias, and finally written back to produce the output. With a single kernel, this could be optimized to a single read/write. However, in addition to requiring a new kernel/CUDA extension to be written you would also need to implement your own backwards kernel for training as the new operation would not be registered in autograd.

Thank you. I’ll try it!

If this works for you use case, please update this thread with your performance results! It could be a great example for others also trying to optimize similar operators/kernels.