Effishent matmul in fused kernals

I am looking to fuse a dequntization into a mathmul operation. now I could always do 1 then the other but I would really like to have it on the same kernel so that there is no need to store weights on global memory.

is there a way to take the existing matmul but run it inside of the kernel? or put a preprocessing step thats inlined to the cuda kernel.