Torch filter multidimensional tensor by start and end values

Adib_Mosharrof · October 20, 2022, 7:57pm

I have a list of sentences and I am looking to extract contents between two items.
If the start or end item does not exist, I want it to return a row with padding only.
I already have the sentences tokenized and padded with 0 to a fixed length.

I figured a way to do this using for loops, but it is extremely slow, so would like to
know what is the best way to solve this.

import torch
start_value, end_value = 4,9

data = torch.tensor([
[3,4,7,8,9,2,0,0,0,0], 
[1,5,3,4,7,2,8,9,10,0],
[3,4,7,8,10,0,0,0,0,0], # does not contain end value
[3,7,5,9,2,0,0,0,0,0], # does not contain start value
])

#expected output
[
[7,8,0,0,0,0,0,0,0,0],
[7,2,8,0,0,0,0,0,0,0],
[0,0,0,0,0,0,0,0,0,0],
[0,0,0,0,0,0,0,0,0,0],
]

AbdulsalamBande · October 20, 2022, 9:46pm

There are two possible solutions

Use the “for loop” then store the processed data in a file. This is to avoid reprocessing the data.
You can use Pytorch dataloader to process your inputs in batches. This can save you so much time. Check Pytorch Dataloaders

Bonus : Pytorch has utility functions like TORCH.NN.UTILS.RNN.PAD_SEQUENCE andTORCH.NN.UTILS.RNN.PACK_PADDED_SEQUENCE

Adib_Mosharrof · October 20, 2022, 9:59pm

i am doing this calculation inside the compute loss function, where the data is the prediction of the model, so I dont think I can use dataloaders here. In my for loop solution I used pad_sequence.
I was wondering if there is a way to do tensor operations instead of using for loop.