my data is stored in a three dimensional tensor (no of samples, length of timeseries, feature dimension).
Concatenating these different samples to one timeseries is in my case for methodological reasons not possible. Hence, I need a custom getitem method that accepts two indices: One to choose the sample and one to choose the index within that sample. What exactly do I have to change in addition? Do I have to write a custom collate_fn class as well? Which adjustments would be necessary there?
I have just realized that this would require to edit the _MapDatasetFetcher(_BaseDatasetFetcher) class since there is a function call:
data = [self.dataset[idx] for idx in possibly_batched_index
I.e., apparently it is not wanted that the dataset getitem method can use two indices. So do I even have a chance to implement my approach with two indices in the getitem method?
I don’t know of a way to give the dataset two indicies but I can think of another solution to your problem.
When you initialize your dataset you could build a mapping from a one-dimensional index to your two dimensional index. I wrote some code that hopefully could help you
from torch.utils.data import Dataset
class MyDataset(Dataset):
def __init__(self, data_file):
self.data_file = data_file
self.index_map = {}
index = 0
for sample in data_file: # First dimension
sample_index = sample['index']
for timeseries in sample: # Second dimension
timeseries_index = timeseries['index']
self.index_map[index] = (sample_index, timeseries_index)
index += 1
def __getitem__(self, idx):
sample_index, timeseries_index = self.index_map[idx]
# Use the two indices to get the desired data
...
Hi,
The fact is that you will have a fixed number of samples. You can think of a sample as a NN input. So if you need 2 indices as your data is N_samples,length you can just write the dataset as if you have N_sample x length samples and create a mapping between (N_samples*length ) --> (N_samples,length)
Dataset class is flexible, you just return how many elements you have and it will iterate over that amount. If your data is more complex you just need to think how to code it. I would say that it doesn’t accept several indices as it follow the maxima “1 input - 1 index”
Hi thank you both for your ansers. @Juan what I have not mentioned in my question is that my timeseries have different lengths, so your advice wouldnt work unless I pad them.
@Oli your suggestion is very interesting! However, when testing your code I get an error
IndexError: too many indices for tensor of dimension 2
Which python version do you use? (I use python 3.6)
EDIT: I have adjusted the code like this, such that it works also on my system:
index_map = {}
index = 0
for sample_index, sample in enumerate(data_file): # First dimension
for timeseries_index, timeseries in enumerate(sample): # Second dimension
index_map[index] = (sample_index, timeseries_index)
index += 1