How to write a c extension operation with both cpu/gpu and float/double version

How to write a c extension operation with both cpu/gpu and float/double version?

There are several examples:

  1. custom-cuda
  2. extension-cffi
  3. audio
  4. pytorch-ctc

And there are some similar questions in forum, you can look at it~