How to separate between one trayectory from another in trainset

HI there! I’ve created a dataset in which 5 trajectories taken at the same scenario are concatenated in this way:

  • 1 trajectory has N of positions measured (this is the number of rows) & a fixed number of columns (shared between trajectories) that represent the data of each position of the trajectory. So each trajectory has shape (N, n_columns)
    So the dataset has shape: (N*number_trajectories, n_columns). My problem is that I’m trying to train a LSTM with different sequence length, being the sequence length how many positions does the LSTM see for predicting the output.
    However I have a problem separating the trayectories for training: some of the final positions of one trajectory are fed with the first positions of the new trayectory because of how I feed the network with the sequence length.
    A comparison just in case I’m not explaining: it’s like if my dataset is compose by 7 reviews expressed as one large string, and I mix the last words of my first review with the first words of my second review that have nothing to do one with another. So I know that I can separate this by spliting the reviews by new line, creating a vector in which each row is a review.

However I dont know how to do this with my vector data. Any help?

You’ll have to either write code for hierarchical sampling (timeseries, then random slice) or cut some tails (with nn.utils.rnn.pack_padded_sequence perhaps). Sounds like you’re using a simple 2d data loader, it is probabably best to fix things at that level, than attempting to implement mid-sequence reset schemes.

Hi there! Thanks I get what you’re telling me. That’s why I’m trying to change how i get the item within the custom dataset. However my sequences are of different sizes so I don’t really know how can I obtain when I call the dataloader different sequence for each idx. I know this wont work for batch_size=! 1 but I wanna try first if it works for batch_size=1, any help?

You can use torch’s DataLoader to iterate over timeseries, and in addition manually sample random time offsets. This can be done inside collate_fn (see collate_batch in cell 20 here as an example of collate_fn, though it is a bit different - full sequences are padded). Or inside __iter__ (IterableDataset), that’s more conventional way.

Thanks, I’ve seen that I can create a custom sampler for my custom dataset so thats what I’m trying. However I’ll take a look at your proposal.