I have a three dimensional (x,y,z) NumPy array of audio clips. X is the number of data points, length of y is basically the length of utterances, and z is the number of features which remain constant. Each utterance represents a group of frames, and for the sake of this problem I have to randomly choose a frame with its corresponding labels. While choosing each frame it is important to choose it with some context, so basically that means I will be indexing my dataset to select (context+frame) x z subset of data randomly from any utterance to train the model. I need to pad zeros at the start and beginning of each data point x to ensure that frames at the edges have a context on both sides.

Now, y (length of utterance is different for different data points) and I thus cannot directly convert it to a tensor directly. I need to use np.vstack to stack the data points. Now the problem I am facing is while writing the getitem function within torch.utils.data.Dataset, where I only have to index the original elements of y, ignoring the padded elements in the new stacked matrix. Are there any hints that you can provide for this?