Torch autograd slow for self created activation function

Hello,
I created an activation function to play around with and when I use the torch autograd function, it is very slow. So, I created ReLU by myself (similar to the original one in PyTorch) and defined the forward and backward layers. Then I used it against inbuilt nn.ReLU() and I got a 10x reduced speed.

Any reasons why is this happening?
Is it because I havent written GPU optimized code or what? Any thoughts are welcome.

Thanks!

@kumar-shridhar can you please post the code for the ReLU function that you created and the script you used to test them?

class ReLUTest(torch.autograd.Function):

@staticmethod
def forward(ctx, input):
  
    ctx.save_for_backward(input)
    input = input.clamp(min=0)

    return input

@staticmethod
def backward(ctx, grad_output):

    input, = ctx.saved_tensors
    grad_input = grad_output.clone()
    grad_input[input < 0] = 0
    return grad_input