How can I write the CUDA code to support FP16 calculation?

I want to write a custom layer using CUDA. However, It fails when I use NVIDIA-Apex to train the model with mixed-precision. What should I do? Is there any example of FP16 cuda layer?

You could use AT_DISPATCH_FLOATING_TYPES_AND_HALF to dispatch the code for the float16 type and use scalar_t in the code (similar to e.g. this code).
Also note, that we recommend to use the native mixed-precision training utility via torch.cuda.amp instead of apex/amp now.

These codes may help.