Splitting up sequential batches into randomly shuffled train-test subsets

asafbenj · December 17, 2020, 6:05pm

Hi,
I have a sequential data set with shape (n_time_points, n_features). Obviously, I want my batches to contain consecutive timepoint sequences, but I also want to randomly split them up into training, validation and test batches. Going through the tutorials and forums, I came across some of the relevant classes (Dataset, Sebset, DataLoader, SequentialSampler, BatchSampler), but I can’t quite figure out how to combine them to achieve what I want. The methods I’ve found seemed to either shuffle the timepoints within a sequence or create subsets with consecutive batches. The seq-to-seq tutorial actually implements the data loading functions from scratch, which doesn’t seem like a great idea for me, especially if at some point I want to parallelize on multiple GPU, etc. For simplicity, I’m first trying a batch_size of 1, i.e. I just want to sample 1 n_bptt-length sequence at a time, but at some point I might want to use a larger one.
(As you might’ve guessed, I’m pretty new to PyTorch. Any help would be highly appreciated.)

ptrblck · December 18, 2020, 7:00am

How would you like to create the time sequence from the data and how is the data currently stored?
E.g. if your original data is in the shape [n_time_points, features] and you would like to create overlapping windows, you could slice it in the __getitem__ of your custom Dataset.
Here is a simple example code:

class MyDataset(Dataset):
    def __init__(self, data, window_size):
        self.data = data
        self.window_size = window_size
        
    def __getitem__(self, index):
        x = self.data[index:index+self.window_size]
        return x
    
    def __len__(self):
        return len(self.data) - self.window_size + 1

data = torch.arange(100).view(100, 1).float()
dataset = MyDataset(data, window_size=10)
loader = DataLoader(dataset, batch_size=2, shuffle=False)

for batch in loader:
    print(batch)

You can also set shuffle=True to shuffle the passed indices.

asafbenj · December 21, 2020, 7:56am

ptrblck:

class MyDataset(Dataset):
    def __init__(self, data, window_size):
        self.data = data
        self.window_size = window_size
        
    def __getitem__(self, index):
        x = self.data[index:index+self.window_size]
        return x
    
    def __len__(self):
        return len(self.data) - self.window_size + 1

data = torch.arange(100).view(100, 1).float()
dataset = MyDataset(data, window_size=10)
loader = DataLoader(dataset, batch_size=2, shuffle=False)

for batch in loader:
    print(batch)

Thanks a lot for the super fast reply! IMHO, your answers on this forum are the single best learning resource for PyTorch beginners out there…

I was hoping for a solution that loads consecutive seq’s within batch, while randomly splitting batches to train-test subsets, but I guess this may over-complicate things and is not strictly necessary. Instead, I’ve tried to adapt your suggestion for the common case of non-overlapping sequences (e.g. as in the PyTorch tutorial on seq-to-seq with Transformers mention above):

class NoOverlapDataset(Dataset):
    def __init__(self, data, window_size):
        self.data = data
        self.window_size = window_size

    def __getitem__(self, index):
        seq_i = index * self.window_size
        x = self.data[seq_i : seq_i + self.window_size]
        return x

    def __len__(self):
        return len(self.data) // self.window_size

Dose that make sense/look correct?