Returning data of different lengths

Debajyoti_Sengupta · December 29, 2023, 10:42am

Hello all,

I am trying to figure out a solution for my use case where I have to return three objects from my dataset that are of different lengths. Here’s a short snippet that explains what I have now.

class MyDataset(Dataset):
    def __init__(
        self,
        data_dir: list[Path],
    ) -> None:
        """
        Args
        ----
        data_dir: list[Path]
            List of paths to data npy files.
        """
        self.data_dir = data_dir
        data, data2, labels = load_data(self.data_dir)
        # here, data and labels are of the same length
        # data2 is of a different length.
        self.num_len = min(len(data), len(data2))

    def __len__(self) -> int:
        return self.num_len
    
    def __getitem__(self, idx) -> np.ndarray:
            return self.data[idx], self.labels[idx], self.data2[idx]

My model requires self.data during the training stage; self.data and self.data2 in the validation; and self.data, self.labels during the prediction stage.

To keep things running, I am taking the minimum of the two and return that as the length of the dataset, but that means a lot of the data remains unused (?).
I was wondering if there is a smart, more efficient way to deal with this?