Batching custom/layer/module without loops?

For reasons I don’t want to delve too deeply into, I’m running inference using the OpenCV DNN runtime, on a PyTorch model that’s been exported as ONNX. I have no issues with the model in PyTorch or ONNX, but when loading into OpenCV I run into an issue where the MatMul node fails to load. As far as I understand this happens is because the OpenCV DNN implementation of the MatMul op doesn’t support batch dimensions.

Obviously the ideal solution here is raising an issue on the OpenCV repo, but in the short term I’m looking for a way around this.

By squeezing/unsqueezing the batch dimension for this one operation I can get the network working for inference, but but only when the input batch size is one. Is there a way I can wrap the 2D matmul in some kind of batching op that applies the matmul to each batch individually?

I know I could do this with a simple for loops, but I think that runs into issues with variable batch sizes once the JIT trace unrolls them.