Creating batches of sequences

Timothy35964154 · January 4, 2022, 3:25pm

Hello There. I’m currently working on a LSTM Autoencoder. I have a big amount of samples. Each sample contains 120 features. For now, I’m creating sequences of length 1, batch_size is equal to 1 and everything is working fine. I first convert my data array to a list and then using the following function, I convert them to sequences of length 1:

def dataset(mydatalist):
    dataset = [torch.tensor(s).unsqueeze(1) for s in mydatalist] 
 
    n_seq, seq_len, n_features = torch.stack(dataset).shape # n_seq,4,1
    return dataset, seq_len, n_features

Then for training, I write the procedure “for seq_true in train_dataset:” which states the batch_size of 1. But as I have a large amount of samples, the training procedure is too slow. So I want to increase the batch_size in order to achieve better performance. Could anyone please help me with that? I know that it maybe a simple question but everything I try leads to shape-related errors in the LSTM network.
Also it would be very nice if you could also point out how to create sequences with length more than 1 alongside increasing the batch_size.
Many thanks in advance.

vdw · January 7, 2022, 1:29pm

I’m not 100% sure what you problem is. Typically, in case of RNNs going from batch_size = 1 to batch_size > 1 requires to handle sequence of different lengths. However, you say

I’m creating sequences of length 1

which is a bit odd in itself as it’s not really a sequence :). In you code snippet it looks like seq_len=4. So this is all a bit confusing.

Just in case the problem might be sequences of different lengths: My approach is to create batches with sequences of the same length. This is particularly for autoencoders easy since input and target sequences are the same. We talked about this extensively in this post. If it helps, my code for that is available on Github.