Pytorch custom CUDA extension - no for loop needed?

I’m following the official tutorial on writing custom CUDA extension I’m noticing and example CUDA kernel:

template <typename scalar_t>
__global__ void lltm_cuda_forward_kernel(
    const scalar_t* __restrict__ gates,
    const scalar_t* __restrict__ old_cell,
    scalar_t* __restrict__ new_h,
    scalar_t* __restrict__ new_cell,
    scalar_t* __restrict__ input_gate,
    scalar_t* __restrict__ output_gate,
    scalar_t* __restrict__ candidate_cell,
    size_t state_size) {
  const int column = blockIdx.x * blockDim.x + threadIdx.x;
  const int index = blockIdx.y * state_size + column;
  const int gates_row = blockIdx.y * (state_size * 3);
  if (column < state_size) {
    input_gate[index] = sigmoid(gates[gates_row + column]);
    output_gate[index] = sigmoid(gates[gates_row + state_size + column]);
    candidate_cell[index] = elu(gates[gates_row + 2 * state_size + column]);
    new_cell[index] =
        old_cell[index] + candidate_cell[index] * input_gate[index];
    new_h[index] = tanh(new_cell[index]) * output_gate[index];

Usually, when reading other CUDA examples, one can notice that the for loop is used. For example:

  for (int i = blockIdx.x * blockDim.x + threadIdx.x; i < size;
       i += blockDim.x * gridDim.x) {
       //  (...)

I have two questions:

  1. Why is it not used (not necessary?) in the above case?
  2. Is such for loop necessary if I wanted to create something simpler - a custom activation function, applied for each tensor element?

Best regards,

It depends on your use case and if you are writing a “monolithic kernel” or a “grid-stride loop”.
This tutorial explains it using the saxpy example.

Thank you for the reply. This does answer my question.