The thing is that the official relu is implemented as a single kernel that does the relu.
What you do here is 3 different operations: comparison with 0, advance indexing and element wise multiplication.
If you run this on gpu, this will launch 3 kernels instead of 1 and so is expected to be slower.
This is exactly the reason why we have specialized kernels for all the common operations !
All cuda operations are asynchronous, so you should add torch.cuda.synchronize() to make sure that you measure the actual execution time and not just how long it took to queue the kernel.
Avoiding the multiplication will help: x[x < 0] = 0 will be faster.
Avoiding the indexing altogether will be even better I think: x *= (x >= 0).type_as(x).