For CUDA extension implementing element-wise function, what is the best shape for the kernel launch?

I’m writing a Pytorch CUDA extension that implements an element-wise function.
Assuming we don’t know whether the input is 2D or 3D or 4D, but it is known to have the first dimension as batch. How do we choose the best kernel shape (grid size, block size) at launch?

Thank you!

I found a similar question with answer, may be the solution.