I’m having troubles designing a dataset generator and dataloader for an LSTM network. Therefore, I’m working on a rather big time series dataset which is organized as follows:
There is continuously recorded data store in the rows of the data frame. Additionally, there is a column containing the experiment the data point is belonging to as well as another row containing the repetition of the experiment. So, I have a very long data set regarding the rows. But it is discontinued since I have repetitions of variable length and multiple repetitions per experiment.
Now, for each repetition, I want to create sequences of length 1024 let’s say. To do so, I have to iterate over each experiment and for each experiment over each repetition. I followed this blog post: https://medium.com/speechmatics/how-to-build-a-streaming-dataloader-with-pytorch-a66dd891d9dd And my Iterable Dataset looks like this:
class DatasetGenerator(IterableDataset): def __init__(self, data): self.data = data def process_data(self, data): # iteration over unique repetitions (automatically iterates over the experiments this way) for rep in data['repetition'].unique(): sequence = data[data['repetition']==rep] #extract the whole sequence of the current repetition # create sequences of length 'SEQUENCE_SIZE' from the repetition-sequence for i in range(SEQUENCE_SIZE, sequence.shape): feat_seq = sequence.iloc[i - SEQUENCE_SIZE : i, 0:8].values # select feature sequence target = sequence.iloc[i, 32] target = target.reshape(-1, 1) feat_seq = feature_scaler.transform(feat_seq) # transform with feature scaler target = target_scaler.transform(target) yield feat_seq, target def get_stream(self, data): return itertools.cycle(self.process_data(data)) #return itertools.chain.from_iterable(map(self.process_data, itertools.cycle(data)) def __iter__(self): #return itertools.cycle(self.process_data(self.data)) return self.get_stream(self.data) train_gen = DatasetGenerator(train_volume) loader = Data.DataLoader(train_gen, batch_size=512)
Now, when I iterate over the loader in the training loop, I obtain tensors like this:
for features, target in itertools.islice(loader, 2): target = target.view(-1,1) print(features.shape) print(target.shape) # process data with LSTM network...
torch.Size([512, 1024, 8])
This is exactly what I was aiming for. The form of the data is [batch_size, sequence_length, feature_dimension]. However, this seems to be really slow and before I get my first loss output after the first epoch the program crashes usually.
Does anybody see a structural mistake in my approach? I already researched different approaches to accomplish this task. But since I have all my data loaded in the ‘data’ data frame, I thought I could use it as an IterableDataset to create the sequences “on the fly” in my training loop for each batch.
I’m glad for any advice and hint on this:)