Clarification on at:parallel_for needed

In this document slide 43 I read that it is recommended to use at::parallel_for over OpenMP pragmas.

In another post here the individual elements of the tensor are accessed by the operator[], e.g.

torch::Tensor z_out = at::empty({z.size(0), z.size(1)}, z.options());
  int64_t batch_size = z.size(0); 

  at::parallel_for(0, batch_size, 0, [&](int64_t start, int64_t end) {
    for (int64_t b = start; b < end; b++) {
      z_out[b] = z[b] * z[b];
    }
  });

Is this the right way to do or should one still use a tensor accessor (even when using at::parallel_for)?

Thanks, let me ask two follow-up questions.

  1. Is there a recommended way to implement nested loops with at::parallel_for? With OpenMP you can (if the logic of the loops allows for it) use the nested(2) keyword.
  2. Is there a similar approach for tensors residing in GPU memory or do I need to implement a CUDA kernel?

Many thanks in advance.