Speeding up tensor concatenation


I need to concatenate a long list of small tensors. Each small tensor is a slice of a given (quite simple) constant matrix. Here is the code:

max_node, counter = 0, 0
batch_size, n_days = (1000, 10)
n_interactions_in = torch.randint(low=100,high=200,size=(batch_size,n_days), dtype=torch.long)
max_interactions = n_interactions_in.max()
delay_table = torch.arange(n_days, device=device, dtype=torch.float).expand([max_interactions, n_days]).t().contiguous()
delay_table = n_days - delay_table - 1
edge_delay_buf = []
for b in range(batch_size):
     delay_vec = [delay_table[d, :n_interactions_in[b, d]] for d in range(n_days)]
res = torch.cat(edge_delay_buf)

This takes a lot of time. Is there a way to effeciently parrallize the creation of each element in the edge_delay_buf?
I have tried multiple variants, such as replacing the for loop with a list concatenation, where the result is a list of lists, then flattening the list and applying torch.cat on the flattened list. However, it didn’t improve by much. For some reason the slicing operation takes too long.

Is there a way to make the slicing faster? Is there a way to make the loop more efficient / parallel?