Torch expand outputs are different on CPU and CUDA EP

crazyboy9103 · March 13, 2025, 6:23am

I exported pytorch model (two stage faster rcnn) to onnx, and ran inference with onnxruntime.
The model was exported well and the shapes of outputs are as expected, but the results are very different for CPU and CUDA EP. CPU EP outputs same result as eager mode execution, but CUDA EP outputs wildly different result. To debug, i dumped every node’s output with this:

for node in onnx_model.graph.node:
    for output in node.output:
        onnx_model.graph.output.extend(
            [onnx.ValueInfoProto(name=output)]
        )

and compared their outputs layer by layer. It appears the outputs start to diverge a lot at torch.expand, which is confusing as it’s seemingly very simple operation, and so i dont see why CPU and CUDA output different results. Can anyone please help me with this issue?

def postprocess(
    self,
    proposals: Tensor,
    objectness: Tensor,
    image_shapes: Tensor,
    features: List[Tensor],
):
    @torch.jit.script
    def _postprocess(
        proposals: Tensor,
        objectness: Tensor,
        image_shapes: Tensor,
        features: List[Tensor],
        min_size: float,
        score_thresh: float,
        nms_thresh: float,
        pre_nms_top_n: int,
        post_nms_top_n: int,
        proposal_dim: int,
        num_anchors_per_location: int,
    ):
        ....
        # top_n_idx: Tensor of shape [batchsize, pre_nms_top_n]
        top_n_idx = objectness.topk(pre_nms_top_n, dim=1)[1] # this output is same for CPU, CUDA
        # torch.gather is equivalent to the following
        # image_range = torch.arange(num_images)
        # batch_idx = image_range.unsqueeze(1)
        # objectness = objectness[batch_idx, top_n_idx]
        # levels = levels[batch_idx, top_n_idx]
        # proposals = proposals[batch_idx, top_n_idx]
        
        # objectness: Tensor of shape [batchsize, num_anchors]
        objectness = torch.gather(objectness, 1, top_n_idx) # this output is same for CPU, CUDA
        
        # levels: Tensor of shape [batchsize, num_anchors]
        levels = torch.gather(levels, 1, top_n_idx) # this output is same for CPU, CUDA
      
        # proposals: Tensor of shape [batchsize, num_anchors, 4]
        # to use gather, unsqueeze and expand last index
        top_n_idx = top_n_idx.unsqueeze(2).expand(-1, -1, proposal_dim) # this is very different CPU vs CUDA, at expand. 
        proposals = torch.gather(proposals, 1, top_n_idx) # this is also very different 
        ....
        return proposals, objectness

    return _postprocess(
        proposals,
        objectness,
        image_shapes,
        features,
        self.min_size,
        self.score_thresh,
        self.nms_thresh,
        self.pre_nms_top_n(),
        self.post_nms_top_n(),
        self.box_coder.proposal_dim,
        self.anchor_generator.num_anchors_per_location()[0], 
    )

Thank you for your time and help in advance

ptrblck · March 13, 2025, 12:58pm

Could you post a minimal and executable code snippet reproducing the issue in PyTorch, please?
If the issue is specific to inn runtime, I would recommend creating an issue in their GitHub repository.