Hi Everybody,
I’m having troubles designing a dataset generator and dataloader for an LSTM network. Therefore, I’m working on a rather big time series dataset which is organized as follows:
There is continuously recorded data store in the rows of the data frame. Additionally, there is a column containing the experiment the data point is belonging to as well as another row containing the repetition of the experiment. So, I have a very long data set regarding the rows. But it is discontinued since I have repetitions of variable length and multiple repetitions per experiment.
Now, for each repetition, I want to create sequences of length 1024 let’s say. To do so, I have to iterate over each experiment and for each experiment over each repetition. I followed this blog post: https://medium.com/speechmatics/how-to-build-a-streaming-dataloader-with-pytorch-a66dd891d9dd And my Iterable Dataset looks like this:
class DatasetGenerator(IterableDataset):
def __init__(self, data):
self.data = data
def process_data(self, data):
# iteration over unique repetitions (automatically iterates over the experiments this way)
for rep in data['repetition'].unique():
sequence = data[data['repetition']==rep] #extract the whole sequence of the current repetition
# create sequences of length 'SEQUENCE_SIZE' from the repetition-sequence
for i in range(SEQUENCE_SIZE, sequence.shape[0]):
feat_seq = sequence.iloc[i - SEQUENCE_SIZE : i, 0:8].values # select feature sequence
target = sequence.iloc[i, 32]
target = target.reshape(-1, 1)
feat_seq = feature_scaler.transform(feat_seq) # transform with feature scaler
target = target_scaler.transform(target)
yield feat_seq, target
def get_stream(self, data):
return itertools.cycle(self.process_data(data))
#return itertools.chain.from_iterable(map(self.process_data, itertools.cycle(data))
def __iter__(self):
#return itertools.cycle(self.process_data(self.data))
return self.get_stream(self.data)
train_gen = DatasetGenerator(train_volume)
loader = Data.DataLoader(train_gen, batch_size=512)
Now, when I iterate over the loader in the training loop, I obtain tensors like this:
for features, target in itertools.islice(loader, 2):
target = target.view(-1,1)
print(features.shape)
print(target.shape)
# process data with LSTM network...
Print output:
torch.Size([512, 1024, 8])
torch.Size([512, 1])
This is exactly what I was aiming for. The form of the data is [batch_size, sequence_length, feature_dimension]. However, this seems to be really slow and before I get my first loss output after the first epoch the program crashes usually.
Does anybody see a structural mistake in my approach? I already researched different approaches to accomplish this task. But since I have all my data loaded in the ‘data’ data frame, I thought I could use it as an IterableDataset to create the sequences “on the fly” in my training loop for each batch.
I’m glad for any advice and hint on this:)