Unroll for loops in forward pass for GPU

I have an input tensor of size (batch_size, X, Y) and need to pass it though the forward step of my custom model.
At high level in the forward step:
I loop over each batch and send the inner tensor of shape (X, Y) to another model that gives me something of shape (X,Z).
Then I need to do the average over X and assign this result to each batch to get a final tensor of shape (batch_size, Z).

I can do it with for-loop, but I think it might be inefficient to run on a GPU (also, I am getting some CUDA error right now). A naive-example to better explain the problem.

def forward(self, in_tensor)
    """ in_tensor something like
    in_tensor = tensor([[[ 1.,  2.,  3.],
                                    [ 4.,  5.,  6.]],

                                    [[ 1.,  2.,  3.],
                                    [ 4.,  5.,  6.]]], dtype=torch.float) # shape [2,2,3]
    res = []
    for batch, el in enumerate(in_tensor):
        # el has shape [2,3]
        tmp_res = other_model(el) # other model return something with 10 features for each input
        # tmp res has shape (2, 10)
        res.append(torch.mean(tmp_res, dim=0))
    res = torch.stack(res, dim=0)

I am wondering if there is a way to avoid the for-loop.

Did you ever figure out a way? I was looking into something like this and found TorchScript. Maybe putting it through the JIT will let it unroll the for loop, but I have not tested it at all and am just proposing a solution because I am in the same spot.

If you found a solution without using JIT, would very much like to know!