Parity mismatch between torch and tensorRT for ultralytics yolo-v11 model

Hi all.
I am trying to convert ultralytics YOLOv11 to tensorRT. I get a parity mismatch (the output tensors from the torch model do not match the output tensors from the tensorrt model).
Has anybody encountered this issue in the past?

Experimentation:
I am using torch.compile to convert the torch model to tensorrt. (in the past, I have tried using torch.onnx.export to convert the model to onnx, and I observe similar parity mismatch there as well [issue LINK])

Minimum Code for reproducibility:

Step 1: modification to the ultralytics codebase to output tensors from every layer:

def _predict_once(self, x, profile=False, visualize=False, embed=None):
        """
        Perform a forward pass through the network.

        Args:
            x (torch.Tensor): The input tensor to the model.
            profile (bool): Print the computation time of each layer if True.
            visualize (bool): Save the feature maps of the model if True.
            embed (list, optional): A list of feature vectors/embeddings to return.

        Returns:
            (torch.Tensor): The last output of the model.
        """
        y, dt, embeddings = [], [], []  # outputs
        layer_outputs = [None] * (len(self.model)-1)    #-------> store the output for every layer. used for parity comparison
        for layer_idx, m in enumerate(self.model):
            if m.f != -1:  # if not from previous layer
                x = y[m.f] if isinstance(m.f, int) else [x if j == -1 else y[j] for j in m.f]  # from earlier layers
            if profile:
                self._profile_one_layer(m, x, dt)
            x = m(x)  # run
            y.append(x if m.i in self.save else None)  # save output
            if visualize:
                feature_visualization(x, m.type, m.i, save_dir=visualize)
            if embed and m.i in embed:
                embeddings.append(torch.nn.functional.adaptive_avg_pool2d(x, (1, 1)).squeeze(-1).squeeze(-1))  # flatten
                if m.i == max(embed):
                    return torch.unbind(torch.cat(embeddings, 1), dim=0)
            if layer_idx < (len(self.model)-1):
                layer_outputs[layer_idx] = x.clone()
        return x, tuple(layer_outputs)

Step2:

Running the model compilation + parity test:

import torch
from ultralytics import YOLO
import torch_tensorrt

if __name__ == "__main__":
    torch.cuda.synchronize()
    
    model_torch = YOLO("yolo11n.pt").model.to(device='cuda')
    input_tensor = torch.randint(0, 255, (1, 3, 480, 640), device='cuda').float() / 255.0

    layers = list(model_torch.children())[0]
    
    trt_model = torch.compile(model_torch, backend="tensorrt")

    with torch.no_grad():
        trt_out = trt_model(input_tensor)
        torch_out = model_torch(input_tensor)

    for i in range(len(torch_out[1])):
        print(f"max_delta[{i}] - [{type(layers[i])}] = {torch.max(torch.abs(trt_out[1][i] - torch_out[1][i]))}")
    print(f"max_delta[{i+1}] - [{type(layers[-1])}] = {torch.max(torch.abs(trt_out[0][0] - torch_out[0][0]))}")

On running this test, I get the max-delta for each layer to look something like this:

as we can see here, the max delta for some of the layers is very high. can anybody help me figure this out?