How can I write the CUDA code to support FP16 calculation?

These codes may help.