Sequential Dataloader for LSTM Analysis

Hi Everybody,

I’m having troubles designing a dataset generator and dataloader for an LSTM network. Therefore, I’m working on a rather big time series dataset which is organized as follows:

There is continuously recorded data store in the rows of the data frame. Additionally, there is a column containing the experiment the data point is belonging to as well as another row containing the repetition of the experiment. So, I have a very long data set regarding the rows. But it is discontinued since I have repetitions of variable length and multiple repetitions per experiment.

Now, for each repetition, I want to create sequences of length 1024 let’s say. To do so, I have to iterate over each experiment and for each experiment over each repetition. I followed this blog post: https://medium.com/speechmatics/how-to-build-a-streaming-dataloader-with-pytorch-a66dd891d9dd And my Iterable Dataset looks like this:

class DatasetGenerator(IterableDataset):
    
    def __init__(self, data):
        self.data = data
        
    def process_data(self, data):
        # iteration over unique repetitions (automatically iterates over the experiments this way)
        for rep in data['repetition'].unique():
            sequence = data[data['repetition']==rep] #extract the whole sequence of the current repetition
            # create sequences of length 'SEQUENCE_SIZE' from the repetition-sequence
            for i in range(SEQUENCE_SIZE, sequence.shape[0]):
                feat_seq = sequence.iloc[i - SEQUENCE_SIZE : i, 0:8].values # select feature sequence
                target = sequence.iloc[i, 32]
                target = target.reshape(-1, 1)
        
                feat_seq = feature_scaler.transform(feat_seq) # transform with feature scaler
                target = target_scaler.transform(target)
            
                yield feat_seq, target
                    
    def get_stream(self, data):
        return itertools.cycle(self.process_data(data))
        #return itertools.chain.from_iterable(map(self.process_data, itertools.cycle(data))
        
    def __iter__(self):
        #return itertools.cycle(self.process_data(self.data))
        return self.get_stream(self.data)
    
train_gen = DatasetGenerator(train_volume) 
loader = Data.DataLoader(train_gen, batch_size=512)

Now, when I iterate over the loader in the training loop, I obtain tensors like this:

for features, target in itertools.islice(loader, 2):
    target = target.view(-1,1)
    print(features.shape)
    print(target.shape)

    # process data with LSTM network...

Print output:
torch.Size([512, 1024, 8])
torch.Size([512, 1])

This is exactly what I was aiming for. The form of the data is [batch_size, sequence_length, feature_dimension]. However, this seems to be really slow and before I get my first loss output after the first epoch the program crashes usually.

Does anybody see a structural mistake in my approach? I already researched different approaches to accomplish this task. But since I have all my data loaded in the ‘data’ data frame, I thought I could use it as an IterableDataset to create the sequences “on the fly” in my training loop for each batch.

I’m glad for any advice and hint on this:)