How to use cpu_kernel_vec

Hello,

I am trying to write custom quantized operator. This is my first dive into the cpp code base and it seems difficult for me to understand how cpu_kernel_vec from Loops.h works.

Would you mind to help with this example?

AT_DISPATCH_QINT_TYPES(out.scalar_type(), "qmul", [&]() {
    using Vec = Vec256<scalar_t>;
    cpu_kernel_vec(
        iter,
        [&](scalar_t a, scalar_t b) -> scalar_t {
          int32_t a_sub_z = static_cast<int32_t>(a.val_) -
              static_cast<int32_t>(self_zero_point);
          int32_t b_sub_z = static_cast<int32_t>(b.val_) -
              static_cast<int32_t>(other_zero_point);
          int32_t c = a_sub_z * b_sub_z;
          scalar_t res = at::native::requantize_from_int<scalar_t>(
              multiplier, zero_point, c);
          if (ReLUFused) {
            res.val_ = std::max<scalar_t::underlying>(res.val_, zero_point);
          }
          return res;
        },
        [&](Vec a, Vec b) -> Vec {
          Vec::int_vec_return_type a_sub_zp =
              a.widening_subtract(Vec(static_cast<scalar_t>(self_zero_point)));
          Vec::int_vec_return_type b_sub_zp =
              b.widening_subtract(Vec(static_cast<scalar_t>(other_zero_point)));
          Vec::int_vec_return_type c;
          for (int i = 0; i < Vec::int_num_vecs(); ++i) {
            c[i] = a_sub_zp[i] * b_sub_zp[i];
          }
          Vec rv = Vec::requantize_from_int(c, multiplier, zero_point);
          if (ReLUFused) {
            rv = rv.maximum(Vec(static_cast<scalar_t>(zero_point)));
          }
          return rv;
        });
  });
}

So here I can see that cpu_kernel_vec takes

  1. iter that data to process,
  2. function that processes scalar values,
  3. function that processes vectors.

I have hard times to understand when the first is called. Does it overrides Vec multiplication operator? If so how does it know which operator to override.
Confused :confused:

@tom
Hello Thomas,
maybe you can guide me?

Function #2 processes AVX2 chunks (e.g. blocks of 8 qint32 values). Function #1 is still needed to handle incomplete blocks or non-contiguous tensors - loops.h has the logic to call it as needed.

Thank you!

So do we just need to specify in function #1 what should be done with scalars, and in function #2 what should be done with vectors? And do not care about in which conditions they’ll be used?

Yeah, that would be one of approaches. You don’t have to use cpu_kernel_vec - there is also cpu_kernel, or use can dispatch to an arbitrary c++ template function (e.g. cuda operators look a bit differently, though built-in ones often use similar gpu_kernel function)