Combinations of a set of tensors and GPU optimization

Hello everyone,

I am a Pytorch beginner and I would like to do the combinations of a set of 1D tensors with different shapes and then concatenate the results in a single output.
I coded this, which give me the desired output:

import torch

# Input example
sub_data_idx = torch.tensor([2,  5,  5,  0,  4,  1,  4,  5,  3,  2,  1,  0,  3,  3,  0,  0,  2,  1]).cuda()
data_idx = torch.tensor([ 0,  1,  2,  3,  4,  5]).cuda()

# initialization data_combinations_pair for torch.cat
data_combinations_pair = torch.zeros([1, 2]).type(torch.LongTensor).cuda()

# Loop over the data indices each corresponding to a 1D tensor each with a different length
for i in data_idx:
   sub_data_item = (sub_data_idx == i.item()).nonzero().flatten()
   data_combinations = torch.combinations(sub_data_item)
   data_combinations_pair=torch.cat([data_combinations_pair, data_combinations])

# erase initialization
data_combinations_pair = data_combinations_pair[1:]

My problem: avoid the “for loop” which is time consuming when using GPU.

Is there a pytorch function that can do that?
Does anyone have an idea how to avoid this “for loop”?

I thought about using these Pytorch functions inside a Numba kernel function but I also do not know if torch.combinations is a device or a kernel function.

Thank you in advance for all your help and for Pytorch developers.

Best regards,