[ROCm][CI] fp8 acceptable accuracy threshold

Hello,

I observed that in:

the accuracy threshold of that test is rtol=5e-2, atol=0.07. It is not clear to me why this is an acceptable accuracy threshold.

  1. Why can that threshold not be optimized for better accuracy?
  2. Is there a similar test on CPU or other platform that could provide some insight as to what thresholds are used for other hardware?

This question is related to ongoing investigations on kernels used for vLLM software: Add padding support to wvSplitK solution for skinny GEMMs by amd-hhashemi · Pull Request #33762 · vllm-project/vllm · GitHub

can you post in torch.compile - PyTorch Forums

After internal discussion, I realized that 0.05 tolerance for fp8 is perfectly normal. There is a blog on PyTorch for that as well: Some Matrix Multiplication Engines Are Not As Accurate As We Thought – PyTorch

And this is not ROCm specific, as there are more places with cross-platform FP8 tests (including CUDA): pytorch/test/test_scaled_matmul_cuda.py at 4a03a41be77fa58c7e7886509932f30867e8db99 · pytorch/pytorch · GitHub