Custom Dataset with __getitem__ method that requires two indices as input

Hi all,

my data is stored in a three dimensional tensor (no of samples, length of timeseries, feature dimension).
Concatenating these different samples to one timeseries is in my case for methodological reasons not possible. Hence, I need a custom getitem method that accepts two indices: One to choose the sample and one to choose the index within that sample. What exactly do I have to change in addition? Do I have to write a custom collate_fn class as well? Which adjustments would be necessary there?
I have just realized that this would require to edit the _MapDatasetFetcher(_BaseDatasetFetcher) class since there is a function call:
data = [self.dataset[idx] for idx in possibly_batched_index
I.e., apparently it is not wanted that the dataset getitem method can use two indices. So do I even have a chance to implement my approach with two indices in the getitem method?

https://github.com/pytorch/pytorch/blob/master/torch/utils/data/_utils/fetch.pyhttps://github.com/pytorch/pytorch/blob/master/torch/utils/data/_utils/fetch.py

2 Likes

Hi Raphi!

I don’t know of a way to give the dataset two indicies but I can think of another solution to your problem.

When you initialize your dataset you could build a mapping from a one-dimensional index to your two dimensional index. I wrote some code that hopefully could help you

from torch.utils.data import Dataset


class MyDataset(Dataset):
    def __init__(self, data_file):
        self.data_file = data_file
        self.index_map = {}
        index = 0
        for sample in data_file:  # First dimension
            sample_index = sample['index']
            for timeseries in sample:  # Second dimension
                timeseries_index = timeseries['index']
                self.index_map[index] = (sample_index, timeseries_index)
                index += 1

    def __getitem__(self, idx):
        sample_index, timeseries_index = self.index_map[idx]
        # Use the two indices to get the desired data
        ...

3 Likes

Hi,
The fact is that you will have a fixed number of samples. You can think of a sample as a NN input. So if you need 2 indices as your data is N_samples,length you can just write the dataset as if you have N_sample x length samples and create a mapping between (N_samples*length ) --> (N_samples,length)

Dataset class is flexible, you just return how many elements you have and it will iterate over that amount. If your data is more complex you just need to think how to code it. I would say that it doesn’t accept several indices as it follow the maxima “1 input - 1 index”

Hi thank you both for your ansers.
@Juan what I have not mentioned in my question is that my timeseries have different lengths, so your advice wouldnt work unless I pad them.

@Oli your suggestion is very interesting! However, when testing your code I get an error
IndexError: too many indices for tensor of dimension 2

Which python version do you use? (I use python 3.6)

EDIT: I have adjusted the code like this, such that it works also on my system:

index_map = {}
index = 0
    for sample_index, sample in enumerate(data_file):  # First dimension
        for timeseries_index, timeseries in enumerate(sample):  # Second dimension
        index_map[index] = (sample_index, timeseries_index)
        index += 1

1 Like

BTW you can return lists of tensors instead of stacked tensors if that helps :slight_smile:
Dataloader can return nested python structures as well.

My code wasn’t meant to be used as actual code, just to show you my idea. You need to adapt it to your situation :slight_smile:

Great that you dit it :smiley:

Thanks thats a useful info, too. Great!

Haha ok, thanks you!

Using following logic for __len__ gives an extra advantage of considering all possible chunks in the timeseries during training.

    def __len__(self):
        return len(self.index_map)