Compiler support for aarch64 CPU inference

Hi, I’m wondering how complete the Dynamo/Inductor support currently is for aarch64? It seems compiler has good coverage for x86 but not aarch64.

For example, if I pick the hugging-face “bert-base-uncased” model and compile it with default options, I got 2.5x worse latency on aarch64 (25ms) than on x86 (10ms). To be more specific, I’m comparing the the c6i (Intel Xeon) and c7g (ARMv8.4-a) instance types using AWS EC2.

So my questions are: Is there any near future plan on aarch64 support/optimization for Dynamo/Inductor? Any roadmap so far?

Thanks!

Hi @yd2102, please check this PR I’ve pushed today, this enables the mkldnn passes for aarch64 when it’s compiled with ACL backend. I have observed up to 5.8x performance improvement on AWS c7g instance for bert base uncased inference with torch.compile().