[ROCm][CI] fp8 acceptable accuracy threshold

andreasceid · February 6, 2026, 9:11pm

Hello,

I observed that in:

test/inductor/test_fp8.py

696b6fa86


      
                      "triton.use_tensor_descriptor": False,
                      "max_autotune_gemm": True,
                      "max_autotune_gemm_backends": "ATEN,TRITON",
                      "max_autotune_gemm_search_space": "EXHAUSTIVE",
                  }
          
                  with config.patch(patch_cfg):
                      compiled = torch.compile(linear, mode="max-autotune")
                      actual = compiled(a, b, scale_a, scale_b, scale_r)
          
                  self.assertEqual(expected, actual, rtol=5e-2, atol=0.07)
          
          
          class TestFP8Lowering(TestCase):
              @unittest.skipIf(not PLATFORM_SUPPORTS_FP8, f8_msg)
              @parametrize("dtype", (torch.bfloat16, torch.float32))
              @parametrize("shape", ("16,16,32", "16,32,32", "1024,1024,512"))
              @parametrize("has_bias", (False, True))
              @parametrize("use_fast_accum", (False, True))
              @parametrize(
                  "persistent_matmul", [False, True] if has_triton_tma_device() else [False]

the accuracy threshold of that test is rtol=5e-2, atol=0.07. It is not clear to me why this is an acceptable accuracy threshold.

Why can that threshold not be optimized for better accuracy?
Is there a similar test on CPU or other platform that could provide some insight as to what thresholds are used for other hardware?

This question is related to ongoing investigations on kernels used for vLLM software: Add padding support to wvSplitK solution for skinny GEMMs by amd-hhashemi · Pull Request #33762 · vllm-project/vllm · GitHub

jerryzh168 · February 17, 2026, 11:18pm

can you post in torch.compile - PyTorch Forums

andreasceid · February 20, 2026, 6:35pm

After internal discussion, I realized that 0.05 tolerance for fp8 is perfectly normal. There is a blog on PyTorch for that as well: Some Matrix Multiplication Engines Are Not As Accurate As We Thought – PyTorch

And this is not ROCm specific, as there are more places with cross-platform FP8 tests (including CUDA): pytorch/test/test_scaled_matmul_cuda.py at 4a03a41be77fa58c7e7886509932f30867e8db99 · pytorch/pytorch · GitHub