Hello all,
I am trying to figure out a solution for my use case where I have to return three objects from my dataset that are of different lengths. Here’s a short snippet that explains what I have now.
class MyDataset(Dataset):
def __init__(
self,
data_dir: list[Path],
) -> None:
"""
Args
----
data_dir: list[Path]
List of paths to data npy files.
"""
self.data_dir = data_dir
data, data2, labels = load_data(self.data_dir)
# here, data and labels are of the same length
# data2 is of a different length.
self.num_len = min(len(data), len(data2))
def __len__(self) -> int:
return self.num_len
def __getitem__(self, idx) -> np.ndarray:
return self.data[idx], self.labels[idx], self.data2[idx]
My model requires self.data
during the training stage; self.data
and self.data2
in the validation; and self.data
, self.labels
during the prediction stage.
To keep things running, I am taking the minimum of the two and return that as the length of the dataset, but that means a lot of the data remains unused (?).
I was wondering if there is a smart, more efficient way to deal with this?