Dataset Class for sequential data

0x97 · February 27, 2023, 10:46am

Dear all,
I’m dealing with sequential data, and I would like to find a way to use the torch Dataset class, but I cannot find any example for sequential data.

Context
My dataset is a stack of M matrices, where each matrix is a high-dims time-series of shape (N,T) (T=time series length), resulting in a dataset tensor of shape (M,N,T).
The model operates in a one-to-many style, meaning that i feed a tensor of shape (N), and it outputs a tensor of shape (N,t), with t<T the sequence length.
The idea during training is to extract a batch of timeseries of size m<M, forming tensors of shapes (m,N,t), with initial values (m,N).

Question
Given the above context, is the Dataset class a good way to operate? I’ve already written a class myself with some methods for dealing with my specific case, but I’ve read the Dataset class could be used to parallelize training, since the main problem is loading the entire tensor on RAM, which in certain cases could be too expensive.

Jamie_Donnelly · February 27, 2023, 1:49pm

A Dataset class is definitely useful for parallel training since the DistributedSampler used in the DataLoader constructor will handle backend co-ordination using rank etc.

All you really need to implement for a simple Dataset class are __len__ and __getitem__ methods. Depending on how your data is stored on disk you don’t need to load all the data into memory at once.

E.g., if you save your data for each timestep on disk as a tuple like (input, target) at .../data_time_i.pt you can implement something as simple as,

class TestDataset(Dataset):
    def __init__(self,root):
         self.files = [os.path.join(root,i) for i in os.listdir(root)]

    def __len__(self):
         return len(self.files)

    def __getitem__(self, index: List):
          x, y = torch.load(self.files[index])
          return x, y

Obviously how your Dataset class looks will depend a lot on how your data is currently stored on disk, but above code is easily changed to reflect minor changes, such as if it’s saved in numpy arrays or csv files, etc.