Lowering exir to quantized operations (no delegate)

jpiat · February 17, 2025, 1:14pm

Hi,

i’m working with executorch and torch.ao to enable quantization aware training in our workflow and possibly lowering a quantized implementation using executorch but no delegate. I wrote a Quantizer that can be configured to match multiple qscheme and i get a properly annotated graph than can be used in QAT. After pt2e conversion i get a model with inserted quantizer/dequantizer that uses the ATen operators (floating points) that i can convert to edge. PyTorch has a set of quantized ATen operators (pytorch/aten/src/ATen/native/quantized at main · pytorch/pytorch · GitHub) and i’d like to be able to use them to run the model inference. Looking through executorch code, there is a mention of replace_quantized_partition_with_op in exir/backend/utils but i don’t see any use of it in the repository. Is there any way to replace dq/op/q partitions in the exir graph to use exir/ATen quantized operators without using a delegate ? In executorch/kernels/aten/functions.yaml there is no ATen quantized operators registered and the quantized folder only registers a very limited set of operations (mainly quantize/dequantizers), is there any plan to have quantized operators supported in the runtime with no delegate ?

swolchok · February 18, 2025, 4:05pm

I believe ATen mode is not supported in OSS (and it’s not supported on mobile/embedded devices anyway AFAIK), so I don’t think this would help you even if it existed.

no delegate

Can you say more about why XNNPACK is not suitable for your use case?

is there any plan to have quantized operators supported in the runtime with no delegate ?

@mergennachin / @manuelcandales do either of you know if this is in the plan?

jiaoliang_yu · May 22, 2025, 12:54pm

Hi,
I’m attempting to replace the gemm with a custom acceleration library. However, I’ve executed llama3_2_3b_spinquant pte and found that it uses a delegate to call the xnnpack operators. There are too many kernels in xnnpack, making it difficult for me to identify the specific kernel. Can I use native operators to perform computations on quantized operators instead of using the delegate, or is it possible to find the corresponding xnnpack kernel using the profiling information provided below?
│ Execute │ Fully Connected (NC, QD8, F32, QB4W) GEMM #1 │