Custom activation function is slow

Jow1e · August 14, 2021, 11:33am

Hi!

I’m new here and i am writing custom activation function with learnable parameters:

How it looks:

The formula:
x[ x < 0 ] = x[ x < 0 ] * alpha_low
x[ x > bias ] = x[ x > bias ] * alpha_high + bias

I tried several ways implementing it and here is the fastest result:

class CustomRelu(nn.Module):
	def __init__(self, alpha_low=0.01, alpha_high=0.01, bias=1.):
		super(CustomRelu, self).__init__()
		
		self.prelu_low = nn.PReLU(init=alpha_low)
		self.prelu_high = nn.PReLU(init=alpha_high)
		
		self.bias = bias
	
	def forward(self, x):
		x= -self.prelu_low(x) + self.bias
		x= -self.prelu_high(x) + self.bias
		return x

But the problem is, its still much slower than built in activation functions (about 5-10 times slower)
Is there a way how to optimize it?

eqy · August 14, 2021, 7:02pm

Is this operation done on CPU or GPU? One potential for optimization on GPU here is writing a custom kernel/extension: Custom C++ and CUDA Extensions — PyTorch Tutorials 1.9.0+cu102 documentation to fuse the operations together, because as it is written, the activation function incurs many global memory read/writes. From a quick glance, x is read, negated, written back, read again to add bias, written back, read again to negate, written back, read again to add bias, and finally written back to produce the output. With a single kernel, this could be optimized to a single read/write. However, in addition to requiring a new kernel/CUDA extension to be written you would also need to implement your own backwards kernel for training as the new operation would not be registered in autograd.

Jow1e · August 14, 2021, 9:10pm

Thank you. I’ll try it!

eqy · August 14, 2021, 9:38pm

If this works for you use case, please update this thread with your performance results! It could be a great example for others also trying to optimize similar operators/kernels.