How to separate between one trayectory from another in trainset

Laura_Montalvo · April 27, 2021, 9:38am

HI there! I’ve created a dataset in which 5 trajectories taken at the same scenario are concatenated in this way:

1 trajectory has N of positions measured (this is the number of rows) & a fixed number of columns (shared between trajectories) that represent the data of each position of the trajectory. So each trajectory has shape (N, n_columns)
So the dataset has shape: (N*number_trajectories, n_columns). My problem is that I’m trying to train a LSTM with different sequence length, being the sequence length how many positions does the LSTM see for predicting the output.
However I have a problem separating the trayectories for training: some of the final positions of one trajectory are fed with the first positions of the new trayectory because of how I feed the network with the sequence length.
A comparison just in case I’m not explaining: it’s like if my dataset is compose by 7 reviews expressed as one large string, and I mix the last words of my first review with the first words of my second review that have nothing to do one with another. So I know that I can separate this by spliting the reviews by new line, creating a vector in which each row is a review.

However I dont know how to do this with my vector data. Any help?

googlebot · April 27, 2021, 7:59pm

You’ll have to either write code for hierarchical sampling (timeseries, then random slice) or cut some tails (with nn.utils.rnn.pack_padded_sequence perhaps). Sounds like you’re using a simple 2d data loader, it is probabably best to fix things at that level, than attempting to implement mid-sequence reset schemes.

Laura_Montalvo · April 28, 2021, 8:21am

Hi there! Thanks I get what you’re telling me. That’s why I’m trying to change how i get the item within the custom dataset. However my sequences are of different sizes so I don’t really know how can I obtain when I call the dataloader different sequence for each idx. I know this wont work for batch_size=! 1 but I wanna try first if it works for batch_size=1, any help?

googlebot · April 28, 2021, 10:13am

You can use torch’s DataLoader to iterate over timeseries, and in addition manually sample random time offsets. This can be done inside collate_fn (see collate_batch in cell 20 here as an example of collate_fn, though it is a bit different - full sequences are padded). Or inside __iter__ (IterableDataset), that’s more conventional way.

Laura_Montalvo · April 28, 2021, 10:22am

Thanks, I’ve seen that I can create a custom sampler for my custom dataset so thats what I’m trying. However I’ll take a look at your proposal.