Is there any example for cpp extension to support for mixed precision training and to support torch.compile? im not sure how to make my cuda kernel to support it.
i Think its better to use ATen DISPATCH method so your cuda kernel or c++ function adapt on any floating point type specially 16 float point number and this is a call example
AT_DISPATCH_FLOATING_TYPES(feats.type(), "trilinear_fw_cu",
([&] {
trilinear_fw_kernel<scalar_t><<<blocks, threads>>>(
feats.packed_accessor<scalar_t, 3, torch::RestrictPtrTraits, size_t>(),
points.packed_accessor<scalar_t, 2, torch::RestrictPtrTraits, size_t>(),
feat_interp.packed_accessor<scalar_t, 2, torch::RestrictPtrTraits, size_t>()
);
}));