Splitting up sequential batches into randomly shuffled train-test subsets

Hi,
I have a sequential data set with shape (n_time_points, n_features). Obviously, I want my batches to contain consecutive timepoint sequences, but I also want to randomly split them up into training, validation and test batches. Going through the tutorials and forums, I came across some of the relevant classes (Dataset, Sebset, DataLoader, SequentialSampler, BatchSampler), but I can’t quite figure out how to combine them to achieve what I want. The methods I’ve found seemed to either shuffle the timepoints within a sequence or create subsets with consecutive batches. The seq-to-seq tutorial actually implements the data loading functions from scratch, which doesn’t seem like a great idea for me, especially if at some point I want to parallelize on multiple GPU, etc. For simplicity, I’m first trying a batch_size of 1, i.e. I just want to sample 1 n_bptt-length sequence at a time, but at some point I might want to use a larger one.
(As you might’ve guessed, I’m pretty new to PyTorch. Any help would be highly appreciated.)

How would you like to create the time sequence from the data and how is the data currently stored?
E.g. if your original data is in the shape [n_time_points, features] and you would like to create overlapping windows, you could slice it in the __getitem__ of your custom Dataset.
Here is a simple example code:

class MyDataset(Dataset):
    def __init__(self, data, window_size):
        self.data = data
        self.window_size = window_size
        
    def __getitem__(self, index):
        x = self.data[index:index+self.window_size]
        return x
    
    def __len__(self):
        return len(self.data) - self.window_size + 1

data = torch.arange(100).view(100, 1).float()
dataset = MyDataset(data, window_size=10)
loader = DataLoader(dataset, batch_size=2, shuffle=False)

for batch in loader:
    print(batch)

You can also set shuffle=True to shuffle the passed indices.

Thanks a lot for the super fast reply! IMHO, your answers on this forum are the single best learning resource for PyTorch beginners out there… :pray:

I was hoping for a solution that loads consecutive seq’s within batch, while randomly splitting batches to train-test subsets, but I guess this may over-complicate things and is not strictly necessary. Instead, I’ve tried to adapt your suggestion for the common case of non-overlapping sequences (e.g. as in the PyTorch tutorial on seq-to-seq with Transformers mention above):

class NoOverlapDataset(Dataset):
    def __init__(self, data, window_size):
        self.data = data
        self.window_size = window_size

    def __getitem__(self, index):
        seq_i = index * self.window_size
        x = self.data[seq_i : seq_i + self.window_size]
        return x

    def __len__(self):
        return len(self.data) // self.window_size

Dose that make sense/look correct?