Different implementation of square function

You can use any of these as they should all dispatch to the same kernel.
A similar question was also asked here.