Does PyTorch/FBGEMM support AVX512 (VNNI) for INT8 Quantization and does it improve performance?

The PyTorch Quantization doc suggests that for efficient optimization, we must use a CPU that has AVX2 support or higher.
If we were to consider transformer class models trained/quantized and served on x86 architectures using FBGEMM as the Quantization Engine,

  • Does INT8 quantization using native pytorch APIs take advantage of AVX512 instruction set, and if so in which version of PyTorch was this support introduced?
  • If AVX512 is supported, what should we expect in terms of performance impact on latency and throughput by moving from AVX2 to AVX512 using native pytorch quantization support?
  • Might there be any benchmarks which already show the quantified improvement?
    Notes:
  • By Native pytorch quantization, we mean what is natively supported by PyTorch without export to ONNX runtime
  • I See a discussion here on Vec512 and AVX512, and couldn’t quite follow if
    • A) this support is already introduced, and
    • B) what the actual performance difference is

Would be great to get some insight here.

Hi @saykarthik, please feel free to take a look at Add AVX512 support in ATen & remove AVX support by imaginary-person · Pull Request #56992 · pytorch/pytorch · GitHub which is in progress and which aims to add this support. There are performance numbers planned to be added the PR before it’s ready to land (they might not be there yet).

1 Like