Fastest implementation for indexing into multi-dimensional tensors

Hello,
I have an application where I keep in GPU memory a 4-dimensional tensor that I use as a lookup table, where I store some pre-computed values in order to avoid to compute them each time in the forward pass of a GCNN layer saving on inference time.
From several experiments I’ve noticed that the more the GPU is performant, the smaller the savings are. The bottleneck is actually on the indexing in that lookup table, indeed it is 10 times slower than the other computations I do after the indexing. Seems that with high performing GPUs such as Nvidia RTX8000, Nvidia RTX A6000 the computation from scratch of these pre-computed values is comparable to just take the pre-computed values stored in the tensor.
My question is: are there in PyTorch faster ways for indexing into multi-dimensional tensors other than just doing tensor[list_indexes1, list_indexes2] ?