Merging multiple ops by Pytorch C++/CUDA extension leads to slower Backward speed

Hey guys, I want to implement this cyclic function:

Specifically, I first implement it by the combination of five Pytorch ops.

def origin_op(x):
    x = torch.add(x, -1)
    x = torch.remainder(x, 4)
    x = torch.add(x, -2)
    x = torch.abs(x)
    x = torch.add(x, -1)
    return x

However, it is rather slow so I re-implement it in CUDA.

template <typename scalar_t>
__device__ __forceinline__ scalar_t forward1(scalar_t z) {
  auto res = fmod(z-1,(scalar_t)4);
  res += res<0?(scalar_t)2:-2;
  return res; // will be cached for backward

template <typename scalar_t>
__device__ __forceinline__ scalar_t forward2(scalar_t z) {
  const auto res = fabs(z) - 1;
  return res;

template <typename scalar_t>
__device__ __forceinline__ scalar_t backward(scalar_t tmp, scalar_t d_output) {
    const auto res = tmp>0? d_output:-d_output;
  return res;

Then I test the speed of each op and get

I am confused:

  1. Why the speed of backward is even SLOWER than the original implementation?
  2. Why torch.sin is as fast as relu?

Looking forward to somebody’s help~

I guess you’re testing with a large tensor and experiencing memory stalls (nvidia’s profiler can report this). Maybe there is also some overhead for attaching “custom” backward into autograd, not sure here.