How to load sequential data from separate datasets?

alaria · December 9, 2023, 5:24pm

I want to create a dataset/dataloader setup that simply yields one batch from one of multiple datasets at a time, sequentially.

I’m new to ML, but this seems like a fairly common task for multiple time series prediction. I have multiple time series’ of the exact same structure and I want to train one model for all of them. For that it is important that the model always receives a sequence of data from one time series at a time, without overlapping them. All approaches I found would ultimately just concatenate them, which is not what I want.

If I have, for example, three datasets of length 10, 20 and 30, and a batch size of 15, then I’d expect my batches to be of length 10, 15, 5, 15, 15.

FHagedorn · March 29, 2024, 6:20pm

Any news on this? Having trouble finding information on the same problem Im also trying to create a Dataset from several time series arrays of varying length.

alaria · March 29, 2024, 7:20pm

I can’t post my code, but I ended up creating a dataset inheriting from the PyTorch dataset myself, which was easier than I thought. It takes all original datasets and generates samples through slicing when requested by the dataloader.

The original question was not really accurate in hindsight though, as it doesn’t use the term batch size correctly. Batches are generated automatically by the dataloader according to its settings. The custom dataset just needs to yield individual sequences of appropriate standardized shapes (e.g. sequence length 15 if one wants 15 time steps in one sequence for time series prediction) across all sets.

FHagedorn · March 30, 2024, 10:47am

Thank you for your reply! I have been trying to do that too, but from what you said the solution probably lies within the slicing in my custom Dataset. Can you maybe post or send me only your getitem func if thats possible for you. I am new to this forum and dont know how to exchange contact data

alaria · March 30, 2024, 12:39pm

My implementation does most of the important stuff in the init function. It iterates over all datasets and remembers where each sequence is supposed to start and end. The getitem() function only does the slice then and returns it.

Ayush_Aditya · March 30, 2024, 3:49pm

i got confused, do you want to create a custom dataset/data loader setup that yields one batch from one of the multiple datasets, and then you have to make a custom dataset for that, but if you want your batches to be dynamic in size it is not possible as per my knowledge.
and the thing you are talking about is sequence length, not a batch size because batch doesn’t work like that batch size means if your batch size is 10 then 1 batch of contains 10 data sequence and model is trained on 1 batch at time and batch doesn’t need to be in sequence. Think about it like in normal machine learning how each row doesn’t need to be in sequence and we randomize the rows while training, now like that batches are your new rows and you are training on a new dataset made up of these batches, you randomly pick any batch and train you model on it .
And for your sequence length, I would be advised that if your smallest dataset’s size is 10 then make that your sequence length.

Good luck.
Ayush