Embedding kernel for DLRM

Is there interest in getting an optimized kernel for Hopper GPU for the DLRM embedding kernel?

Wanted to check before I open a PR? Basically, it makes use of some hopper specific features to improve performance.