Torch.tile is my bottleneck


I’m writing a custom operation that involves running one example through a layer, tiling it to be the full batch size, and then only calculating sparse deltas and applying them. In practice, I’m finding that the tiling operation (torch.tile(X_ref, (X.shape, 1, 1))) is the bottleneck and can take up to 90% of the time of this operation when I run it on a CPU. That being said, it’s not slow, per-say, it’s just that the other operations are very fast. I was wondering if there’s a faster way to do tiling along a single dimension than torch.tile. I’ve already tried X_ref.expand(X.shape, -1, -1).contiguous() but it’s roughly the same amount of time.

Thanks for your time.