Multiple Sequential Datasets

noahw31 · May 10, 2023, 10:03am

Hello everybody. I want to train a Network and my data is of following nature:
I have multiple measurement results stored in np.arrays, these measurements are indpenedent from each other and should not be mixed up, however they should all be used for training. The data is sequential, meaning we have to maintain the temporal dependencies (no shuffling).

How i want to train the network:
Concatenate the datasets (i am using ConcatDataset for this but i am not sure if the data is mixed here somehow).
Create a dataloader, that contains all these datasets and takes batches from each of them without mixing the data. For each batch training step i want the order of the datasets to be random, meaning in the first training step, dataset3 may be first and dataset1 may be last, and a different order in the next step.

I have this code to create my dataloader with a list of datasets

    train_loader = DataLoader(ConcatDataset(train_datasets), batch_size=batch_size, shuffle=False)

Would be happy to receive some recommendations on how this should be solved!
Thanks in advance

ptrblck · May 16, 2023, 4:52pm

This might not be trivial since the DataLoader will just index the ConcatDataset in order as seen here:

dataset1 = TensorDataset(torch.arange(10).float())
dataset2 = TensorDataset(torch.arange(1, 11).float() * 10)
dataset3 = TensorDataset(torch.arange(1, 11).float() * 100)

dataset = torch.utils.data.ConcatDataset((dataset1, dataset2, dataset3))
loader = DataLoader(dataset, shuffle=False, batch_size=8)

for data in loader:
    print(data)
    
# [tensor([0., 1., 2., 3., 4., 5., 6., 7.])]
# [tensor([ 8.,  9., 10., 20., 30., 40., 50., 60.])]
# [tensor([ 70.,  80.,  90., 100., 100., 200., 300., 400.])]
# [tensor([ 500.,  600.,  700.,  800.,  900., 1000.])]

If the batch size doesn’t perfectly match the length of the subset you will end up with mixed samples.

It might be easier to just create separate datasets and DataLoaders, select them randomly, and iterate them sequentially which would also avoid mixing samples from different subsets.

noahw31 · May 16, 2023, 5:14pm

Thank you for the valuable answer, it helps a lot as i was not aware of how exactly the dataloader gets the data from the ddataset. Now that i know this, i agree with your proposal to create separate datasets and dataloaders. I am using pytorch lightning to train (Trainer class) to which i can pass the argument e.g., test_dataloaders. Do you have an idea what would be the best way to pass multiple loaders? From what i have seen it always is recommended to use the ConcatDataset.

Thanks again for taking the time to reply!

ptrblck · May 16, 2023, 6:07pm

I’m not familiar enough with Lightning but it seems test_dataloaders would accept multiple DataLoaders. I’m also unsure why you cannot mix samples from different subsets, but in case it’s a hard requirement you would need to check how Lightning handles multiple loaders in its Trainer class and in particular if it would mix samples.

noahw31 · May 17, 2023, 6:35am

Thanks, i will check it out. I cannot mix the datasets because i am modeling a system in the time domain with memory dependencies, and the datasets are from different measurement setups so the data from the first dataset should not be in the same batch as the second dataset because it is uncorrelated, but the goal is to learn those correlations. Thanks