Triton kernel to efficiently dequantize int4

We would like to quantize LLAMA-3.3 70B using int4 quantization, analogous to what was done in the gpt-fast repo.

Is there a fast triton based de-quantize code sample that we could use for making the forward pass implementation fast?

Are there any code examples for de-quantizing using triton that we could use for experimenting?

Or would torch.compile magically handle this for us?

Thanks!