Torch.compile(mode="max-autotune") produces different inference result from eager mode — is this expected?

Inference result mismatch between eager mode and torch.compile(mode=“max-autotune”)

Hi team,

I’m encountering a result mismatch when running inference using torch.compile.

What I’m doing:

  • I have a PyTorch model defined with standard nn.Module.
# torch.rand(1, 3, 224, 224, dtype=input_dtype)

import torch.nn as nn

class BaseConv(nn.Module):
    def __init__(self, in_channels, out_channels, kernel_size, stride, padding, conv_layer):
        super().__init__()
        self.conv = conv_layer(in_channels, out_channels, kernel_size=kernel_size, stride=stride, padding=padding, bias=False)

    def forward(self, x):
        return self.conv(x)

class ActivatedConv(BaseConv):
    def __init__(self, in_channels, out_channels, kernel_size, stride, padding, conv_layer, activation):
        super().__init__(in_channels, out_channels, kernel_size, stride, padding, conv_layer)
        self.activation = activation

    def forward(self, x):
        return self.activation(self.conv(x))

class NormalizedConv(ActivatedConv):
    def __init__(self, in_channels, out_channels, kernel_size, stride, padding, conv_layer, norm, activation):
        super().__init__(in_channels, out_channels, kernel_size, stride, padding, conv_layer, activation)
        self.norm = norm(out_channels)

    def forward(self, x):
        return self.activation(self.norm(self.conv(x)))

class Conv2DBNReLU(NormalizedConv):
    def __init__(self, in_channels, out_channels, kernel_size, stride, padding):
        super().__init__(in_channels, out_channels, kernel_size, stride, padding, nn.Conv2d, nn.BatchNorm2d, nn.ReLU())

class MyModel(nn.Module):
    def __init__(self, in_channels=3, out_channels=64, kernel_size=3, stride=1, padding=1):
        super().__init__()
        self.conv1 = Conv2DBNReLU(in_channels, out_channels, kernel_size, stride, padding)

    def forward(self, x):
        return self.conv1(x)

def my_model_function(in_channels=3, out_channels=64, kernel_size=3, stride=1, padding=1):
    return MyModel(in_channels, out_channels, kernel_size, stride, padding)

if __name__ == "__main__":
    model = my_model_function()
    print(model)

  • I run inference in two different modes:
    1. Eager mode: output_eager = model(input_tensor)
    2. Compiled mode: output_compiled = torch.compile(model, mode="max-autotune")(input_tensor)

Problem:

The two outputs are not equal, and the difference exceeds typical tolerances like torch.allclose(output_eager, output_compiled, atol=1e-5).

This happens consistently on my model when compiled using torch.compile with mode="max-autotune" (and default backend = “inductor”).

=== Detailed comparison ===
Total number of elements: 3,211,264
Max absolute error: 0.00128412
Mean absolute error: 0.000100889
Max relative error: 23,868.7
Mean relative error: 0.285904
Number of elements exceeding tolerance: 98,102
Percentage of out-of-tolerance elements: 3.05%
Result of torch.allclose(output_eager, output_compiled, atol=1e-5): False

My Question:

  • Is this difference expected due to aggressive optimizations under max-autotune?
  • Or is this possibly a bug or unsupported pattern in my model?

I’d appreciate any clarification, or suggestions on how to debug this further.

Thanks!

Which PyTorch version are you using and do you see the same or similar error in the latest nightly?

Hi, thanks for the follow-up.

Previously I was using PyTorch 2.5.1, and I have just tested it again with the latest nightly build: 2.6.0.dev20241112+cu121.

Unfortunately, the issue still persists. Below is the output difference observed when running the model using OriginalModel + Compile(max-autotune):

=== Output difference details ===

  • Total number of elements: 3,211,264
  • Max absolute error: 0.00102174
  • Mean absolute error: 0.000106659
  • Max relative error: 20,524.3
  • Mean relative error: 0.344893
  • Number of elements exceeding tolerance: 106,797
  • Percentage of elements exceeding tolerance: 3.3257%
  • Result of torch.allclose(output_eager, output_compiled, atol=1e-5): False

Please let me know if I can provide any additional information to help diagnose this further.

Do you have a script that reproduces the problem? That is, what is the shape/dtype/device of your input tensor?

Thanks for the response!

Sure — I’ll prepare and share a minimal script that reproduces the issue, along with full details of the input tensor (shape, dtype, and device). Will update here shortly.

By the way, may I be listed as the original reporter if this turns out to be a confirmed issue?

Thanks again!

import torch
import importlib.util
import os

def load_model_from_file(module_path, model_function_name="my_model_function"):
    model_file = os.path.basename(module_path)[:-3]
    spec = importlib.util.spec_from_file_location(model_file, module_path)
    model_module = importlib.util.module_from_spec(spec)
    spec.loader.exec_module(model_module)

    model_function = getattr(model_module, model_function_name)
    model = model_function()
    return model



def compare_outputs(a: torch.Tensor, b: torch.Tensor, atol=1e-5, rtol=1e-3):
    print("=== Output difference comparison ===")
    diff = a - b
    abs_diff = diff.abs()
    rel_diff = abs_diff / (a.abs() + 1e-8)
    total_elements = a.numel()
    print(f"- Total elements: {total_elements}")
    print(f"- Max absolute error: {abs_diff.max().item():.8f}")
    print(f"- Mean absolute error: {abs_diff.mean().item():.8f}")
    print(f"- Max relative error: {rel_diff.max().item():.8f}")
    print(f"- Mean relative error: {rel_diff.mean().item():.8f}")
    num_exceed = (~torch.isclose(a, b, atol=atol, rtol=rtol)).sum().item()
    print(f"- Elements exceeding tolerance: {num_exceed}")
    print(f"- Percentage exceeding tolerance: {100.0 * num_exceed / total_elements:.4f}%")
    print(f"- torch.allclose: {torch.allclose(a, b, atol=atol, rtol=rtol)}")



if __name__ == "__main__":
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    input_tensor = torch.rand(1, 3, 224, 224, device=device)

    model_path = "xxx/xxx/xxx/xxx.py"
    model = load_model_from_file(model_path).to(device).eval()

    with torch.no_grad():
        output_eager = model(input_tensor)

    compiled_model = torch.compile(model, mode="max-autotune")
    with torch.no_grad():
        output_compiled = compiled_model(input_tensor)

    compare_outputs(output_eager, output_compiled)

@tinywisdom this sounds like a bug to me so far. Please feel free to open an issue directly on GitHub: GitHub · Where software is built

Thank you so much for the response — I really appreciate it!

I’ll keep all future updates, findings, and reproductions in this thread to keep everything in one place.

Thanks again for your time and help!